Comments
rock333 wrote: At the IaaS Cloud layer virtualisation is going to be essential to allow the self service attributes, all painful and slow to do with physical hardware. Moving up the stack to PaaS and SaaS the use of virtualisation may, as you say, be less required if you put lots of smarts into your software. A lot of software does not have those smarts and by utalising virtualisation of the layers below can manipulate existing software architectures to have more cloudy attributes through automation (eg run load balancers and deploy more servers automagically). Over time, as new investment in software at...
Cloud Expo on Google News


2008 West
DIAMOND SPONSOR:
Data Direct
SOA, WOA and Cloud Computing: The New Frontier for Data Services
PLATINUM SPONSORS:
Red Hat
The Opening of Virtualization
GOLD SPONSORS:
Appsense
User Environment Management – The Third Layer of the Desktop
Cordys
Cloud Computing for Business Agility
EMC
CMIS: A Multi-Vendor Proposal for a Service-Based Content Management Interoperability Standard
Freedom OSS
Practical SOA” Max Yankelevich
Intel
Architecting an Enterprise Service Router (ESR) – A Cost-Effective Way to Scale SOA Across the Enterprise
Sensedia
Return on Assests: Bringing Visibility to your SOA Strategy
Symantec
Managing Hybrid Endpoint Environments
VMWare
Game-Changing Technology for Enterprise Clouds and Applications
Click For 2008 West
Event Webcasts

2008 West
PLATINUM SPONSORS:
Appcelerator
Get ‘Rich’ Quick: Rapid Prototyping for RIA with ZERO Server Code
Keynote Systems
Designing for and Managing Performance in the New Frontier of Rich Internet Applications
GOLD SPONSORS:
ICEsoft
How Can AJAX Improve Homeland Security?
Isomorphic
Beyond Widgets: What a RIA Platform Should Offer
Oracle
REAs: Rich Enterprise Applications
Click For 2008 Event Webcasts
SYS-CON.TV
Top Links You Must Click On


Achieving Higher Throughput with SQL Anywhere
Focus on the performance requirement

Six years ago I was asked to implement a data repository to hold network management and performance data. With monitoring systems, databases have a tendency to be large, and have a delicate balance between inserting (logging) new data and deleting old (aging) data out from the system. The SQL Anywhere server database from iAnywhere was what the customer wanted to use, so I first had to evaluate if it could handle the expected load.

It was initially estimated that the solution would need to support 20 to 30 gigabyte databases, where data collection was ongoing and older data was deleted on a rolling window of 30 days. No problem - one gigabyte of data a day in, one gigabyte of data a day out. I easily validated that SQL Anywhere could handle the load. Over the next year, performance requirements doubled, then redoubled again. Currently, I now support many 100-plus gigabyte databases, and have even exceeded 170 gigabytes in a single database server using SQL Anywhere 7.0.4 and 8.0.2. For the purpose of this article, I've been running all tests on version 9.0.2 of SQL Anywhere.

In high throughput systems, the database schema will consist of many types of tables, ranging from configuration data, user data, and tables storing the high throughput data. These tables are usually tracking rapidly changing information over time, such as statistics, usage measurements, and changing conditions. Temperature changes from a few thousand sensors, fluid flow information on a gas pipeline, or network traffic statistics come to mind. This article concentrates on data throughput for inserts and deletes, and how to get the most out of your database server.

While there are many different variables that impact throughput performance, this paper focuses on database schema design and tuning the database server with respect to these high throughput tables. This article does not address performance issues related to application architecture or hardware selection.

Database Design - Focus on Performance
When high throughput is required, every decision needs to be thought of with respect to how it will affect performance. Databases are generally very fast at inserting and deleting data; however, as soon as you move away from this core functionality and start using all the great "features" of relational dtabases, you can easily hamper performance.

In other words, keep these high-performance tables simple. Some good guidelines include:

  • Keep table width to a minimum: If you want to be fast, you can't have everything. This will allow you to fit more records onto a database page and use fewer pages overall. Fewer pages means faster throughput.
  • Use surrogate primary keys with autoincrement: This allows for faster indexing, and in cases where there is a rolling window of time-stamped data, your deletes can be keybased rather than time based, or worse yet multi-field based.
  • Triggers should be avoided at all cost: Calling a stored procedure when inserting one record every minute or so is not very expensive. We're shooting for inserting hundreds of rows or more per second.
  • Avoid constraints that can be moved into middleware: These include not null constraints, default values, and boundary constraints.
In systems where you are constantly collecting data, the more you can limit the access of the information, the higher the rate of performance. The savings here are from using a limited set of indexes and foreign keys and educating developers on what data is available and how to access it. If you need only two ways of accessing the dataset and you are supporting seven indexes on a table just in case someone accesses it, you are unnecessarily hurting throughput. The removal of Foreign Key constraints is dangerous, but if you are including the constraint for "completeness" and never actually using the constraint, then it may be a candidate for deletion.

Indexing
Application developers and the database designer responsible for maintaining high performance are usually at odds when it comes to indexing. The number and size of indexes directly affect the insertion rate of a table. When the need for massive throughput exists, the bare minimum of indexes should be used. I have gone as far as to restrict areas of application design in order to maintain a minimum set of indexes. This is a tricky trade-off since adding an application feature that represents 1% of the application could affect 50% of the application when it comes to performance. In the worst case it can make the entire application unusable.

A Throughput Example
The example schema below is valid, but contains several of the constructs that can hinder throughput, such as no autoincrement key, column constraints, extra indexes, and extra fields. I will use this to run performance tests of 100,000 records and make corrections on each test run. Each test changes only one aspect of the given design, and the final test applies all the changes. Tests were run an IBM desktop running Windows 2000 with a single Pentium 4 processor and 1.5 Gigabytes of RAM. The disk is a Seagate Barracuda 7200rpm IDE.

Results are given in elapsed time for each test. The key is the difference between each test and the baseline test, and not the elapsed time. As with anything, the actual elapsed time can be impacted by a wide range of variables, such as hardware, software, application design and, of course, what we're most interested in, schema design. See Listing 1.

SQL Anywhere Options
There are a few options that can be set in the server to further improve throughput, however, you need to consider the importance of your data. Since high throughput systems tend to have fairly benign data, you can be a little more cavalier in commit/rollback settings.

Setting the options Delayed_Commits to "On" and Cooperative_Commit to "Off" allows you to streamline throughput to the database. By changing these settings, you are allowing the application to continue processing instead of waiting to find out that the data absolutely got committed. However, you could lose data in the event of a system or hardware failure. Since high throughput data is often expendable, and any catastrophic event like a system or hardware failure typically means I'm going to lose data anyway, I typically change these for the database connections responsible for high-performance tables.

Rerunning Test 7 with Delayed_Commits to "On" and Cooperative_Commit to "Off," the elapsed time drops from 95.5 seconds to 60 seconds, representing an additional 59% gain.

Some Physical Considerations
Most performance problems usually lie in the application design, SQL, and database design. However, the physical choices you make can have a dramatic affect on performance. This area is deserving of a white paper, but a few basic tips include:

  • A fancy $100,000 server with slow disk throughput can be beaten in performance tests by a $500 system with a $100 IDE disk drive. A big Sun workstation ships with internal disks that typically run at 40Mb per second throughput. For a few hundred dollars, I can easily get a SCSI setup running at 160Mb or 320Mb per second on a cheap personal computer.
  • Most disks are physically formatted at 4096 block sizes - so use a 4096 database page size. Running Test 7 on a database using a 1024 page size, the elapsed time increased from 95.5 seconds to 100 seconds, which is a 5% decrease in performance.
  • The faster the disks RPM rate, the faster you can write data.
  • Only use hardware RAID controllers. Software RAID performs horribly.
  • In my experience, SQL Anywhere on Windows runs faster than Linux, which runs faster than Solaris.
  • Memory is cheap, so buy a lot. This will not directly help insert and delete rates, but when you start querying, especially with "order by" or "group by" clauses, you can easily force the database server to use Temp space. Temp space is a file on the disk and we want all disk usage to be for inserting and deleting and not to assist selects.
Backup, Transaction Logs, and Recovery
Recovery can become a major issue with large databases when not done properly. A power outage causing a restart of the server, for example, will automatically cause a SQL Anywhere server to go into an automatic recovery. In certain instances, I've seen the recovery process go on for several days in 7.0.4 and 8.0.2. The database can become inaccessible. Some users have gone as far as to find the dbsrv process and kill it manually. Let's think about this. The database was killed by accident, which is causing it to repair itself, and their solution was to kill it again, just to make sure that it stands even less chance of automatically recovering? Doesn't make a whole lot of sense.
About Todd Loomis
Todd Loomis is an independent consultant who specializes in database design and development frameworks. He has designed systems in many areas to include the aerospace, legal, financial, energy, and computer networking industries. Todd has been developing and supporting products using SQL Anywhere as an embedded database for 6 years. A performance-testing tool is available for free at his Website.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

Enterprise Open Source Magazine Latest Stories . . .
Integrated Windows Authentication (IWA) provides a user-friendly interface for single sign-on. IWA uses ‘Simple and Protected GSSAPI Negotiation Mechanism’ (SPNEGO) to allow the initiators and acceptors to negotiate the underlying protocol to be used for authentication. In this article...
Preternaturally quiet since a hedge fund offered to buy it two weeks ago and take it private, Novell stated on Wednesday that the open source Ingres database is available in the free SUSE Studio as part of the SUSE Appliance Program. Novell and Ingres are supposed to jointly support an...
Cloud Computing Journal caught up with the CEO of a major new player in the fast-emerging Cloud ecosystem - a CEO who has taken an interesting and unusual decision. While signing up as the Platinum Plus Sponsor of the 5th International Cloud Expo, he and his company have decided to rem...
Open-Xchange, a provider of business-class open source collaboration software, today announced enhancements that give users telephone and fax integrated with e-mail, contacts, calendar and task information. By combining Open-Xchange (hosted and on-premise editions) with Unified Commun...
Home Energy monitoring products maker People Power has come out with an open source hardware and software application developer kit called SuRF that lets embedded systems developers build energy saving apps for household electronics and devices on top of its Open Source Home Area Netwo...
Novell and Ingres Corporation on Wednesday announced the Ingres database is available within SUSE Studio as part of the SUSE Appliance Program. Both companies have entered into a cooperative agreement to make it easier and more cost-effective for independent software vendors (ISVs) and...
Subscribe to the World's Most Powerful Newsletters
Subscribe to Our Rss Feeds & Get Your SYS-CON News Live!
Click to Add our RSS Feeds to the Service of Your Choice:
Google Reader or Homepage Add to My Yahoo! Subscribe with Bloglines Subscribe in NewsGator Online
myFeedster Add to My AOL Subscribe in Rojo Add 'Hugg' to Newsburst from CNET News.com Kinja Digest View Additional SYS-CON Feeds
Publish Your Article! Please send it to editorial(at)sys-con.com!

Advertise on this site! Contact advertising(at)sys-con.com! 201 802-3021


SYS-CON Featured Whitepapers
ADS BY GOOGLE