Measuring EFT Performance with Perfmon


THE INFORMATION IN THIS ARTICLE APPLIES TO:

  • EFT v7.4.11 and later

Overview

This document outlines the procedures necessary to capture and analyze key performance metrics using Windows’ built-in performance measurement tool, Perfmon, to help EFT administrators.

  • Troubleshoot problematic behavior by highlight performance bottlenecks
  • Baseline performance when operating with a given configuration or version, then benchmark and compare performance when introducing changes to configuration or versions
  • Perform capacity planning by evaluating performance trends over time

To measure performance, Windows expose a large number of performance counters, which Perfmon can sample at specific intervals, displaying and optionally recording those results over time. In addition to Windows’ default counters, EFT exposes a large number of counters of its own, providing valuable insight into EFT’s internal state. By juxtaposing Windows and EFT counters a qualified individual could assess how the system's resources are being affected by applications. For example, if by adding 1 additional connected user to the application, Private Bytes increases by 1 MB, and another 10 users increases by Private Byte usage by another 10MB, we can extrapolate that Private Bytes use will increase in proportion to users added at the rate of approximately 1MB per connected user.

For more information about EFT performance counters, search for "Performance Counters" in your version of the EFT online help.

Counter Creation

To create and capture counters, you can run Perfmon and view counter results as they are captured in real-time, or you can create a Data Collector Set (DCS), which will track a set of desired counters over time, and you can subsequently view the results of lengthy capture after the fact.

Real-Time Counters

Real-Time Capture*

  1. Open Perfmon from start menu
  2. Navigate to Monitoring Tools > Performance Monitor
  3. Add the desired counters (see below)
  4. View results in real-time

*The downside of real-time capturing in the Perfmon tool is that you cannot save or load a set of counters. See next for instructions on how to do so using the command line option.

Create then Export a Perfmon Config

  1. Open a command prompt
  2. Run “perfmon /sys”
  3. Add the desired counters (see table below)
  4. From the File menu, select “Save” then export the configuration exported previously

Load A Perfmon Config into a Running Counter

  1. Open a command prompt
  2. Run “perfmon /sys”
  3. From the File menu, select “Load” then import the configuration exported previously

Long-Term Counters (Data Collectors)

Viewing counter measurements in real-time is useful when troubleshooting or evaluating performance within a narrow time window, such as when attempting to diagnose a slow performance problem. For capacity planning or troubleshooting for rare events (sporadic non-responsiveness), it is preferable to run a set of counters over a longer period, perhaps taking snapshots at longer intervals, and then analyze the resulting measurements after a period of hours or even days. For capacity planning, it may be advisable to run a set of counters for a period of a day, and then repeating at regular intervals (such as weekly), and then analyzing the differences over a longer period (month) to gain a big picture view.

Creating a Data Collector Set (DCS)

  1. Open Perfmon from start menu
  2. Navigate to Data Collector Sets > User Defined > Right click -> New > Data Collector Set
  3. Choose to create one from a template (see importing a DCS below) or create manually
  4. If manually, add all the desired counters

Exporting (Saving) a Data Collector Set (DCS)

  1. Open Perfmon from start menu
  2. Navigate to Data Collector Sets > User Defined
  3. Select a previously created DCS
  4. Right Click and select “Save Template”
  5. Save the file (XML format)

Importing a DCS

  1. Follow the steps under Creating a DCS
  2. At step 3, select the XML file you exported at step 4 under Exporting DCS

Performance Counter Alerts

You can also create long running Data Collectors that can alert you when your user-defined counter thresholds are exceeded (or fall below). These can be extremely valuable for detecting and preventing problems before they occur. For example, if you know that your maximum ARM Queue Size is set to 100,000, then you could set the counter alert at 90,000 so that you are alerted with ample time to react (maybe check on health of SQL or EFT system). The thresholds are not all that sophisticated (e.g. detect if matches criteria over X number of samples), but is good enough for certain measurements that when hit even once, could mean problems such as outages.

To Create a Data Collector Set (DCS) with Performance Counter Monitors

  1. Open Perfmon from start menu
  2. Navigate to Data Collector Sets > User Defined > Right click -> New > Data Collector Set
  3. Choose to create a counter manually (Advanced) then click Next
  4. Select the Performance Counter Alert button followed by Next
  5. Add a counter and set a threshold, repeat as necessary and click Finish when done.

Important System Counters

For the purpose of evaluating EFT’s performance, you will need to monitor both system and application (EFT) counters. The following table outlines a set of critical counters related to CPU, disk, memory, and network resources, while also calling out specific counters that EFT publishes. Keep in mind that this is a small subset of overall counters available, so feel free to add others that you think are important. In the Expected Values section, we outline the threshold values that if exceeded could indicate a problem with that particular resource. In the next section we will provide more detail on how to read and analyze data collector results.

There are plenty of resources online that provide in-depth analysis on how to read and understand various performance counters. Below are just a few resources you can find with a simple Google search:

Specific disk counters: https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon/

More disk counter info: https://blogs.technet.microsoft.com/askcore/2012/03/16/windows-performance-monitor-disk-counters-explained/#comments

Network counters: https://docs.microsoft.com/en-us/windows-server/networking/technologies/network-subsystem/net-sub-performance-counters

Advice on measure the performance of a SQL server by monitoring SQL server objects and counters: https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/sql-server-xtp-in-memory-oltp-performance-counters?view=sql-server-2017

CPU counters: https://docs.microsoft.com/en-us/sql/relational-databases/performance-monitor/monitor-cpu-usage?view=sql-server-2017

Counter

Information Provided

Expected Value / Notes

Processor (CPU)

Processor\% Processor Time

The percentage of time that the processor spends active, and the percent of processing capacity being used by the processor. Note that this is the same counter as Processor Information > Processor Time

Less than 85% on average. Note that this is a general measurement of how busy the system is, and it is expected for the CPU to remain while busy; however, if pegged at almost 100% utilization and all other metrics are low, then you might be CPU bound and should consider investing in a more performant system.

Processor\% User Time

This counter is reflective of what the CPU is doing on behalf of applications, such as looping through an array or running functions within the application itself that don’t involve the system like writing a file to disk (which would fall under privileged time).

Less than 85% on average. User Time and Privileged Time should be looked at as a unit. If PT is consistently higher than UT and the application is performing poorly then it is possible that the CPU is all tied up trying to handle privileged requests that may or may not be tied to the specific application being monitored.

Processor\% Privileged Time

This counter measures the % of CPU utilization dedicated to handling system-oriented tasks that are of higher “privilege” than user (or application) oriented tasks. Generally, the combination of privileged and user time will equal the total processor time.

Less than 85% on average. User Time and Privileged Time should be looked at as a unit. If PT is consistently higher than UT and the application is performing poorly then it is possible that the CPU is all tied up trying to handle privileged requests that may or may not be tied to the specific application being monitored.

Process
(the app)

Process (csftpstes.exe)\

% Privileged, Processor, User Time

This is the same as the above three measurements, however it isolates the measurement of CPU utilization so that it is strictly associated with the EFT server service executable. As such it will be a subset of the overall process.

Less than 85% on average. Keep in mind these are a subset of the three measurements that are taken for the entire system. The reason these are helpful is in case you want to isolate whether EFT is consuming the majority of resources or some other application, such as an AV tool running in the background.

Process (csftpstes.exe)\

Handle Count, Thread Count

These two values are distinct but related. A thread is a set of separate, sequential set of instructions executed by the CPU on behalf of the application. Handles are a logical associated with a resource, such as a file, memory location, or dialog. A thread is typically used to open or obtain a handle to said resource.

Steady values. Thread counts increasing with utilization is normal, as is an increase in handles.; however, if handles or threads are increasing in an unbounded fashion over time, then EFT could be experiencing a memory leak. Note that a large number of threads or handles (even in the tens of thousands) is ok. It is the constant increase with no decrease over time even when server utilization fluctuates or drops that should raise a red flag.

Process (csftpstes.exe)\

Private Bytes

This is generally (with many exceptions) a value that can be associated with how much memory an application is consuming.

Less than 2GB. Note that there are many factors in determining both memory consumption and/or memory leaks. An increase in memory as utilization increase is to be expected; however unbounded increase or memory utilization associated with csftpste.exe exceed 2GB should be looked at.

System

System\

Processor Queue Length

Shows the number of threads waiting to be serviced by the processor. Waiting threads translates directly into slower performance.

No greater than 5 times the number of processors running, on average. Take the number shown and divide by the number of logical processors. If that number is greater than 5, then more processing power might be needed. Google “Processor Queue Length” for in-depth analysis of this metric.

Disk

Physical Disk\ % Idle Time

Amount of time your disks are idle or not performing any action. You can also use % Disk Write Time and % Disk Read Time or just %Disk Time to assess the opposite of idle time. Generally, you don’t need all four. IMPORTANT: While _Total is a valid instance, you should select the actual physical disk that is being utilized. E.g. “c:\”

Greater than 85%, on average. If %Idle time falls below %20 and stays there then it is in constant read or write mode. Couple this measurement with others such as disk queue length and read/writes a second (measured against the disk’s operational specs) to determine if the disk is a bottleneck.

Physical Disk\ Disk Reads /sec and Disk Writes/sec

Overall rate of read and/or write operations on the disk (Can be used to determine IOP’s to evaluate hardware needs and as a benchmark for hardware upgrades.)

Less than 80%. This value is typically the opposite of %Idle Time. Keep in mind that I/O will be high during high load situations.

Physical Disk\ Current and Average Disk Queue Length

Current Disk Queue Length is a snapshot of queued of requests for either read or write at the time when a measurement is taken. The result can be a bit misleading which is why you also want to look at Average Disk Queue length, which derives an average of values between measurement intervals.

Calculating a disk bottleneck off of these numbers is difficult. If back to back measurements of Current Disk Queue Length are the same, then Average Disk Queue Length can be used to measure outstanding I/O requests (otherwise it cannot). It is best to have someone with expertise evaluate these results.

Physical Disk\

Avg. Disk Sec/Read and Write Avg

This is a measurement of the average time it takes in seconds to read (or write) from/to disk. Note that the latency measured is the time it takes from when the partition manager receives the i/o request to the time it completes.

Less than 20. This value is calculated with millisecond precision (the default multiplier is 1000). A value of “5” shown in the log is .005 of a second. If the value increases under load to where 10s of milliseconds latency is detected, on average, it could signify a slowness beneath the partition manager (class driver, or port driver, or device miniport driver, or disk subsystem)

Physical Disk\

Disk Bytes\sec

Measures the disk I/O both read and write

Less than system specs for that disk’s max throughput. There is no specific number to look for, but rather a comparison between the average bytes in the I/O compared to what the disk subsystem is actually capable of.

Others:

Split IO/Sec can be useful for detecting a heavily fragmented disk.

%Free space is useful in case you didn’t realize you were running out of space (especially when measured over time.

Memory

Memory\

Available Mbytes

The amount of free memory.

Less than 80% utilization. If higher and sustained then look into increasing they system’s memory.

Memory\

% Committed Bytes in Use

This is the ratio of Committed Bytes to the Commit Limit

Less than 80% utilization. If higher and sustained then look into increasing they system’s memory.

Network

Network Interface\Bytes Total/Sec

This counter simply measures the overall (inbound and outbound) bytes transferred over the wire at the moment in time the snapshot was taken. When adding this counter, be sure to specify the correct network interface, or just specify all if you aren’t sure which one is being utilized.

Less than 70% utilization, on average. To determine utilization, you must first determine what your available bandwidth and NIC is capable of. Also, the total bytes should be multiplied by 8 to get the Bits per second, as most measurements for throughput will be in bps, no Bps. To determine utilization use this formula: Utilization = ((Total Bytes\Sec * 8)/current bandwidth in bps)*100). During high loads this number may reach saturation thresholds if all other resources are not maxed out. If it does, then bandwidth could be your bottleneck.

EFT Server Counters

ARM Queue Size

Measures the database inserts currently queued up waiting for SQL (or Oracle)

Less than 10,000 on average on a high load server. An occasional spike in queue size is not necessarily a problem; however sustained high numbers in the hundreds of thousands or a growing queue size could indicate a problem with the database server not having the resources to handle the volume of traffic EFT is throwing its way. Note: If the number is pegged at 1,000, then you may need to apply the advanced property in EFT to override the default max allowed queue size (1,000). Change that number to 500,000 or similar to get a better reading from Perfmon.

Connected Admin Count

Shows the count of currently connected admins.

Less than 10 per server node. A large number of concurrently connected admins could result in performance slowdowns as EFT fights to keep configuration changes from stepping all over each other. Ideally you would have no more than a half-dozen privileged admins or a larger set but that are allocated specific (lesser) admin roles, to avoid conflict.

Workspaces Licenses Used

Measures the number of Workspaces current allocated and not expired. This can be useful for determining whether Workspaces are growing at an unbounded rate by heavy user use of the same.

Less than 100,000 by server node. Once this number grows into the tens or hundreds of thousands, EFT can get bogged down as it attempts to manage these resources, such as routing checking for which ones are expired.

EFT Site Counters

All

Each counter measures something that can be useful depending on the troubleshooting situation.

No expected values to measure; however keep an eye on AWE actions queue size as a growing queue could indicate that your max allowed AWE objects and threads is set to low (a set of advanced properties), thus resulting in backed up AWE workflows that could slow down EFT if that queue grows too large.

SQL Counters

Various

Search the web for which counters to measure. Links provided below

If troubleshooting your SQL server (for example, you are trying to determine why EFT’s ARM queue size is growing too large), then there are a number of counters you can run that are specific to the SQL application. Those fall outside the scope of this doc.