Using HammerDB in database failure and failover scenarios

A frequently asked question with HammerDB is when a user is using the TPROC-C workload to test database failure and failover scenarios, by deliberately killing connections or shutting down the database during a workload and then restarting it. This post provides a guide to configuring HammerDB for such a scenario and considerations to be aware of when doing so.

Is a TPROC-C workload valid if you have restarted the database?

This is an important consideration that is often missed, so although it seems obvious, it is worth reiterating that benchmarking applications are not typically designed to be disconnected and the database restarted whilst the workload is running. Of course, you can configure your test environment however you wish, and HammerDB is scripted for exactly this flexibility. However, your results will not be valid or comparable to other results. Additionally, as HammerDB is designed as a benchmarking application, there are no plans to extend functionality to automatically recover in-flight tests from database failures. If the database ‘goes away’ during a test, then the test is invalid and the database should be restarted and the benchmark re-run in its entirety.

Nevertheless, having said this, HammerDB has additional functionality to both make multiple connections to test environments and run test scripts multiple times.

Connect pool functionality for clusters

Firstly, when testing clusters, it is possible to configure a pool of connections for HammerDB to use to connect to multiple instances at the same time. We can also configure multiple databases in the same instance to test at the same time as well. In the example below, we have configured this connect pool for PostgreSQL to connect to 3 different databases and round-robin the connections between all of them. Note that you can direct each stored procedure to a different instance, such as read-write vs read-only.

HammerDB-4.11/config/connectpool$ more pgcpool.xml
<connpool>
<connections>
<c1>
<pg_host>localhost</pg_host>
<pg_port>5432</pg_port>
<pg_sslmode>prefer</pg_sslmode>
<pg_user>tpcc1</pg_user>
<pg_pass>tpcc1</pg_pass>
<pg_dbase>tpcc1</pg_dbase>
</c1>
<c2>
<pg_host>localhost</pg_host>
<pg_port>5432</pg_port>
<pg_sslmode>prefer</pg_sslmode>
<pg_user>tpcc2</pg_user>
<pg_pass>tpcc2</pg_pass>
<pg_dbase>tpcc2</pg_dbase>
</c2>
<c3>
<pg_host>localhost</pg_host>
<pg_port>5432</pg_port>
<pg_sslmode>prefer</pg_sslmode>
<pg_user>tpcc3</pg_user>
<pg_pass>tpcc3</pg_pass>
<pg_dbase>tpcc3</pg_dbase>
</c3>
</connections>
<sprocs>
<neworder>
<connections>c1 c2 c3</connections>
<policy>round_robin</policy>
</neworder>
<payment>
<connections>c1 c2 c3</connections>
<policy>round_robin</policy>
</payment>
<delivery>
<connections>c1 c2 c3</connections>
<policy>round_robin</policy>
</delivery>
<stocklevel>
<connections>c1 c2 c3</connections>
<policy>round_robin</policy>
</stocklevel>
<orderstatus>
<connections>c1 c2 c3</connections>
<policy>round_robin</policy>
</orderstatus>
</sprocs>
</connpool>

To use this XML configuration, select XML Connect Pool in the connection options.

See this post for more details on configuration.  When we run the workload, we now make a connection to all the configured databases and run the transactions according to the defined policy.

If we look at the configured connections when the workload has started, we can see the multiple connections across the defined databases.  Note in this example the main connection for measuring the NOPM/TPM values connects to the databases tpcc1.

00:00:00 postgres: tpcc1 tpcc1 127.0.0.1(52450) idle
00:00:07 postgres: tpcc1 tpcc1 127.0.0.1(34704) idle
00:00:07 postgres: tpcc2 tpcc2 127.0.0.1(34710) SELECT
00:00:07 postgres: tpcc3 tpcc3 127.0.0.1(34722) idle
00:00:00 postgres: tpcc1 tpcc1 127.0.0.1(34738) idle
00:00:07 postgres: tpcc1 tpcc1 127.0.0.1(34748) idle
00:00:07 postgres: tpcc2 tpcc2 127.0.0.1(34758) idle
00:00:07 postgres: tpcc3 tpcc3 127.0.0.1(34770) SELECT
00:00:00 postgres: tpcc1 tpcc1 127.0.0.1(34786) idle
00:00:07 postgres: tpcc1 tpcc1 127.0.0.1(34788) idle
00:00:07 postgres: tpcc2 tpcc2 127.0.0.1(34792) idle
00:00:07 postgres: tpcc3 tpcc3 127.0.0.1(34804) SELECT
00:00:00 postgres: tpcc1 tpcc1 127.0.0.1(34820) idle
00:00:07 postgres: tpcc1 tpcc1 127.0.0.1(34822) SELECT
00:00:07 postgres: tpcc2 tpcc2 127.0.0.1(34826) idle
00:00:07 postgres: tpcc3 tpcc3 127.0.0.1(34836) idle
00:00:00 postgres: tpcc1 tpcc1 127.0.0.1(34850) idle

When the test has finished, the NOPM/TPM values are reported from the main connection.

Some databases will report the NOPM from across the cluster but some only per database and therefore this feature will also report the client side TPM to monitor the values per connection.

"4 Active Virtual Users configured",
"1",
"TEST RESULT : System achieved 3591 NOPM from 24731 PostgreSQL TPM",
"2",
"VU2 processed neworder 18343 payment 18127 delivery 1823 stocklevel 1853 orderstatus 1838 transactions",
"0",
"Vuser 2:FINISHED SUCCESS",
"4",
"VU4 processed neworder 18178 payment 18294 delivery 1791 stocklevel 1861 orderstatus 1860 transactions",
"0",
"Vuser 4:FINISHED SUCCESS",
"5",
"VU5 processed neworder 18269 payment 18206 delivery 1824 stocklevel 1866 orderstatus 1819 transactions",
"0",
"Vuser 5:FINISHED SUCCESS",
"3",
"VU3 processed neworder 18789 payment 18668 delivery 1912 stocklevel 1878 orderstatus 1761 transactions",
"0",
"Vuser 3:FINISHED SUCCESS",
"1",
"CLIENT SIDE TPM : neworder 10511 payment 10470 delivery 1050 stocklevel 1065 orderstatus 1039",
"0",
"Vuser 1:FINISHED SUCCESS",
"0",
"ALL VIRTUAL USERS COMPLETE"

When testing clusters, this feature allows you to measure the throughput across multiple instances at the same time and therefore also test the planned failure of some of the databases you are testing.

Virtual User Iterations

If you want to test killing connections and have HammerDB reconnect, then you should use the Virtual User Iterations feature. You can think of this value as an outer loop around Total Transactions per User set in the Driver Options for a workload. For example, if you change Total Transactions per User to 100,000 and Virtual User Iterations to 10 then each Virtual User will do 1000,000 transactions in total across 10 separate sessions having logged off and on 10 times. Of course, however, if you kill the connection first then the session will also reconnect, allowing you to do the same for failure scenarios.

When you create the virtual users, you can see the iterations value in the central panel.

Now if you terminate a session, in this example with select pg_terminate_backend(9011); where 9011 is a Virtual User pid. You can see the error, and that the Virtual User has reconnected.

Also note that when killing connections, they can be left in an uncertain state that can result in crashes or errors when the same thread starts running again. Therefore, it is best practice to add a short pause before reconnecting, as shown with the line “after 2000” below and also to preemptively close the connection before trying to reconnect, in this example with catch {pg_disconnect $lda}

Summary

Benchmarking applications are not typically designed for failure and failover scenarios. Nevertheless, HammerDB provides users with the functionality to create their own bespoke test environments that allow them to test multiple instances in a cluster and also to reconnect and restart a test after a failure. Using both features either independently or combined enables having an application that can test database or connection failure instead of benchmarking performance.

Comparing HammerDB TPROC-C results with sysbench-tpcc

In a recent project comparing systems for MariaDB performance, a user had originally been using a tool called sysbench-tpcc to compare hardware platforms before migrating to HammerDB.  However, the user was not aware that the performance could be compared between the 2. This is a brief post to highlight the metrics to use to do the comparison using a separate hardware platform for illustration purposes.

Firstly, it is worth noting that both HammerDB TPROC-C and sysbench-tpcc run workloads based on the TPC-C specification, however as described here HammerDB is called TPROC-C to correctly comply with the TPC fair use rules. 

Also note that whereas HammerDB offers a feature to do a fixed throughput workload close to the specification. In this case, we will only show the workloads run without keying and thinking time, as only HammerDB offers both. HammerDB also runs natively on Windows and Linux with GUI, CLI and Web interfaces on multiple databases, but in this case the example will be on MariaDB on Linux with the CLI.

Prepare or build the schema

Firstly, before running a workload, you need to build or prepare the schema. sysbench-tpcc offers the ability to build multiple schemas to work around scalability issues, however the TPC-C specification uses a single set of tables which can be built as follows.

./tpcc.lua --mysql-socket=/tmp/mariadb.sock --mysql-user=root --mysql-password=maria --mysql-db=tpccsb --time=300 --threads=64 --report-interval=1 --tables=1 --scale=400 --db-driver=mysql prepare

The equivalent in HammerDB is the buildschema command, with example settings as below. The scripts can be in Python or Tcl format.

#!/bin/tclsh
# maintainer: Pooja Jain

puts "SETTING CONFIGURATION"
dbset db maria
dbset bm TPC-C

diset connection maria_host localhost
diset connection maria_port 3306
diset connection maria_socket /tmp/mariadb.sock

set vu 100
set warehouse 400
diset tpcc maria_count_ware $warehouse
diset tpcc maria_num_vu $vu
diset tpcc maria_user root
diset tpcc maria_pass maria
diset tpcc maria_dbase tpcc
diset tpcc maria_storage_engine innodb
if { $warehouse >= 200 } {
diset tpcc maria_partition true
} else {
diset tpcc maria_partition false
}
puts "SCHEMA BUILD STARTED"
buildschema
puts "SCHEMA BUILD COMPLETED"

and run as follows from the command line.

 ./hammerdbcli auto ./scripts/tcl/maria/tprocc/maria_tprocc_build.tcl

You can see that both schemas are similar when built, with the main difference being that sysbench adds prefixes to a number of columns to aid with compression whereas HammerDB is closer to the specification.

MariaDB [tpccsb]> show tables;
+------------------+
| Tables_in_tpccsb |
+------------------+
| customer1        |
| district1        |
| history1         |
| item1            |
| new_orders1      |
| order_line1      |
| orders1          |
| stock1           |
| warehouse1       |
+------------------+
9 rows in set (0.000 sec)

MariaDB [tpccsb]> select * from warehouse1 limit 1;
+------+------------+--------------------+--------------------+-----------------+---------+-----------+-------+------------+
| w_id | w_name     | w_street_1         | w_street_2         | w_city          | w_state | w_zip     | w_tax | w_ytd      |
+------+------------+--------------------+--------------------+-----------------+---------+-----------+-------+------------+
|    1 | name-ussgn | street1-suwfdxnitk | street2-sdptwkrcjd | city-wowgpzhpmq | fu      | zip-12460 |  0.12 | 8398416.00 |
+------+------------+--------------------+--------------------+-----------------+---------+-----------+-------+------------+

MariaDB [tpcc]> show tables;
+----------------+
| Tables_in_tpcc |
+----------------+
| customer       |
| district       |
| history        |
| item           |
| new_order      |
| order_line     |
| orders         |
| stock          |
| warehouse      |
+----------------+
9 rows in set (0.000 sec)

MariaDB [tpcc]> select * from warehouse limit 1;
+------+-----------+--------+----------+---------------+---------------------+--------------------+---------+-----------+
| w_id | w_ytd     | w_tax  | w_name   | w_street_1    | w_street_2          | w_city             | w_state | w_zip     |
+------+-----------+--------+----------+---------------+---------------------+--------------------+---------+-----------+
|    1 | 300000.00 | 0.1800 | kyKhVJqn | ukYR4HaaEJLVi | icFhnjwgqE3cexTJFwR | Kxf1T7pcaHNyvELEIx | lH      | 358511111 |
+------+-----------+--------+----------+---------------+---------------------+--------------------+---------+-----------+

Running the workloads

We can run the workloads as follows, and in the example on both we will use 80 threads or Virtual Users in HammerDB terminology. To run the workload on sysbench-tpcc is the following.

./tpcc.lua --mysql-socket=/tmp/mariadb.sock --mysql-user=root --mysql-password=maria --mysql-db=tpccsb --time=300 --threads=80 --report-interval=1 --tables=1 --scale=400 --db-driver=mysql run

As the workloads are based on the same specification, you can use HammerDB to monitor the sysbench-tpcc workload with the metstart command starting the CPU monitor and tcstart the transaction counter.

hammerdb>metstart
Starting Local Metrics Agent on ubuntu
after#1
hammerdb>Connecting to Agent to Display CPU Metrics
Metric receive port open @ 27702 on ubuntu
Connecting to HammerDB Agent @ localhost:10000
Testing Agent Connectivity...OK
Metrics Connected
Started CPU Metrics for Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz:(112 CPUs)
hammerdb>tcstart
Transaction Counter Started

hammerdb>0 MariaDB tpm
CPU all usr%-0.00 sys%-0.02 irq%-0.00 idle%-99.97
0 MariaDB tpm

and a HammerDB example script as follows:

#!/bin/tclsh
# maintainer: Pooja Jain

set tmpdir $::env(TMP)
puts "SETTING CONFIGURATION"
jobs profileid 0
dbset db maria
dbset bm TPC-C
giset timeprofile xt_gather_timeout 1200
giset commandline keepalive_margin 1200

diset connection maria_host 127.0.0.1
diset connection maria_port 3306
diset connection maria_socket /tmp/mariadb.sock
#
diset tpcc maria_user root
diset tpcc maria_pass maria
diset tpcc maria_dbase tpcc
diset tpcc maria_driver timed
diset tpcc maria_rampup 2
diset tpcc maria_duration 5
diset tpcc maria_no_stored_procs false
diset tpcc maria_allwarehouse false
diset tpcc maria_timeprofile true
diset tpcc maria_purge true
#start CPU
metstart
puts "TEST STARTED"
loadscript
vuset vu 80
vuset logtotemp 1
vucreate
tcstart
tcstatus
vurun
tcstop
vudestroy
#stop CPU
metstop
puts "TEST COMPLETE"

Note that a key parameter here is setting maria_no_stored_procs to true or false. HammerDB uses stored procedures for higher throughput as fully explained here, but also offers a client SQL based version for comparison, sysbench implements a client SQL based version and therefore this is a key understanding for the difference between the 2 workloads. The HammerDB workload is run as shown:

./hammerdbcli auto ./scripts/tcl/maria/tprocc/maria_tprocc_run.tcl

Comparing the results

When running the 80 thread sysbench-tpcc workload, monitoring with HammerDB we can see the following output.

hammerdb>0 MariaDB tpm
CPU all usr%-17.58 sys%-4.18 irq%-0.00 idle%-78.21
485784 MariaDB tpm
CPU all usr%-17.87 sys%-4.30 irq%-0.00 idle%-77.80
497376 MariaDB tpm
CPU all usr%-17.79 sys%-4.36 irq%-0.00 idle%-77.84
478644 MariaDB tpm
CPU all usr%-17.74 sys%-4.35 irq%-0.00 idle%-77.88
485718 MariaDB tpm

and when it has finshed it prints output as follows:

SQL statistics:
    queries performed:
        read:                            31779084
        write:                           32983004
        other:                           4898362
        total:                           69660450
    transactions:                        2449061 (8162.57 per sec.)
    queries:                             69660450 (232173.91 per sec.)
    ignored errors:                      10685  (35.61 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      8162.5668
    time elapsed:                        300.0356s
    total number of events:              2449061

Latency (ms):
         min:                                    0.35
         avg:                                    9.80
         max:                                  307.66
         95th percentile:                       41.10
         sum:                             23997083.58

Threads fairness:
    events (avg/stddev):           30613.2625/297.35
    execution time (avg/stddev):   299.9635/0.01

The key figure here is 8162.57 per sec, multiplied by 60 gives us 489,754 TPM (transactions per minute) and is the figure we can use for comparison as can be seen from the HammerDB transaction output giving the same data.

Running HammerDB with stored procedures we can see the difference in CPU utilisation and transactions.

CPU all usr%-52.76 sys%-6.52 irq%-0.00 idle%-40.52
1512462 MariaDB tpm
CPU all usr%-52.85 sys%-6.43 irq%-0.00 idle%-40.55
1519824 MariaDB tpm
CPU all usr%-53.01 sys%-6.47 irq%-0.00 idle%-40.35
1515888 MariaDB tpm
CPU all usr%-52.99 sys%-6.36 irq%-0.00 idle%-40.51
1524312 MariaDB tpm

and it is the TPM value that we use for comparison and not NOPM, as both tools are measuring transactions per second/minute.

Vuser 1:80 Active Virtual Users configured
Vuser 1:TEST RESULT : System achieved 632885 NOPM from 1471425 MariaDB TPM

if we use the HammerDB no stored procedures option we can see that performance drops as would be expected.

Vuser 1:Test complete, Taking end Transaction Count.
Vuser 1:80 Active Virtual Users configured
Vuser 1:TEST RESULT : System achieved 491964 NOPM from 1143635 MariaDB TPM

and refering to the previous article we can see that we are using more system time as we are now spending more time in the network.

1155714 MariaDB tpm
CPU all usr%-52.57 sys%-11.44 irq%-0.00 idle%-35.90
1163382 MariaDB tpm
CPU all usr%-52.73 sys%-11.57 irq%-0.00 idle%-35.61
1154976 MariaDB tpm
CPU all usr%-52.60 sys%-11.71 irq%-0.00 idle%-35.60
1153836 MariaDB tpm
CPU all usr%-52.56 sys%-11.67 irq%-0.00 idle%-35.68

Analyzing results

HammerDB will also automatically generate graphs for you to analyze your workload and detailed response times per transaction.

Summary

Of course the more benchmarks and workloads you run against a system, the more insights you can get. All benchmarks are valuable, however it is important to ensure that you deriving accurate results.

In our example a user was initially using sysbench-tpcc to compare different hardware systems for MariaDB however was drawing conclusions about both the hardware and database software capabilities that was not in keeping with our observations. Using this approach we provided an alternative measurement and showed how both approaches compared to illustrate the capabilities of both the hardware and software.