Load Testing Reveals Cause of SharePoint Server Performance Problem

A New System’s Poor Performance

Although it seems obvious that adding hardware resources to a system should provide improved performance, a customized Microsoft Office SharePoint® Server (MOSS) website cluster showed the opposite in recent testing for a customer. Load testing with 200 simulated users gave disappointing results, with page durations between 10 and 30 seconds, and the system handling only about four pages per second. Curiously, reducing the cluster to a single SharePoint® server improved the performance.

Suspicions of Trouble

Our customer, the Society for Human Resources Management, was worried. Their new web server system, intended as the primary interface to their 250,000 members, appeared to be too slow. In only four months it would be in production, and it was intended to provide improved work-flow and publishing features, as well as an enhanced customer experience.

Our initial load testing showed that their concerns were justified. The new site could not handle 200 simulated users, let alone the anticipated load of 1500 simultaneous users.

There should not have been any problem. The hardware and software provided plenty of capacity, with a cluster of four servers, each an HP VL360 G5 with 2 quad-core processors. Three of them were running SharePoint® Server on 16G of RAM, and the fourth ran the database server on 32G. The servers sat behind a Cisco CSS load balancer on a 45Mbps DS3 line. The database storage was an EMC Clarion CX-500 SAN and the web servers used only local disk storage. All servers were running Windows Server 2003 64-bit Enterprise SP2. The web servers ran Microsoft Office SharePoint® 2007 64-bit, and the SQL server ran SQL Server 2005, roll-up 8.

Candidate Causes

What could the problem be? The possibilities included the hardware – CPU, network, memory or disk, the software, and the software configuration. In particular, we were aware that we should suspect connection pools, thread pools, resource contention and database locking.

Designing the Tests

With the customer’s help, we proposed a handful of test cases to exercise about 500 pages from their site. This would give our load testing software, Web Performance Load Tester®, a repeatable interface to a relatively small subset of their entire site, a content-rich site with over 15,000 articles. A key consideration was to get rapid results on a very short deadline. We selected five test cases that exercised various navigation paths through the site.

Initial Tests

We were now ready to execute the first tests on the new site. Initial tests were not promising. Under a simulated load of 100 simultaneous users, the system returned pages, on average, in less than 3 seconds – but that was only after the first group of users had passed the homepage and login steps.

Page duration greater than 20 seconds

As we ramped up, the average page durations (APDs) peaked at over 12 seconds. After the second group of users was added, for a total of 200, average page durations exceeded 20 seconds, as shown in this chart:

Initial testing shows poor performance
Figure 1: Initial testing shows poor performance – 20-30 second page durations at 200 users

During the test, Load Tester’s Server Monitoring Agents gathered metrics that indicated hardware was not the bottleneck. Neither CPU, memory or disk were taxed during the tests. Subsequent tests and investigations indicated that the network and load balancer were not the limiting factor either.

Testing the cluster’s individual SharePoint® servers

The next step was to isolate each SharePoint® web server in the cluster and test them individually. These tests revealed a number of differences between the servers. For instance, one server was not compressing the page content. More importantly, we found that running the site with only a single SharePoint® web server resulted in better performance! A single server gave average page durations under 6 seconds with up to 300 users. This was three times the capacity of the system running three web servers. (As you view this chart, note that the test ran for a shorter period than the previous one, with a resulting change in scale on the Users axis and the Time axis.)

A single server performed betterFigure 2: A single server performed better, but performance is still not acceptable

CPU usage not scaling with applied load

We also noted that CPU utilization was not scaling linearly with the applied user load. At about 400 users, the CPU utilization peaked on the web and database servers around 60% and 30% respectively.

CPU utilization levels off after 400 users
Figure 3: CPU utilization levels off after 400 users

Hardware not the problem

Additional user load did not raise these CPU levels. Indeed, CPU usage declined as more load was added. After the peak, additional load did not raise the key throughput metrics, such as hits/sec, pages/sec and bytes/sec. The server metrics did not indicate a bottleneck in any other hardware category (network, memory or disk), leaving software or software configuration as the most likely limiting factor. The most common culprits in this situation are connection pools, thread pools, resource contention and database locking. However, there was no indication in the test data that the pools or other resources were not configured correctly. Several DBAs had monitored the database server during the tests and none saw evidence of locking behavior. It was time to delve deeper into SharePoint-specific areas of concern.

SharePoint® Tuning

During the next series of tests, we focused on testing a single server, since there was little point in load testing and tuning a cluster of servers with the individual servers not operating up to their potential.

A number of optimizations to the SharePoint® configuration were suggested and implemented.

* We moved static resources (images, etc) to an image library to facilitate caching of the resources in the browser;
* We changed SharePoint® cache settings to Extranet Publishing Site;
* We changed the custom role provider to use Role Provider Caching;
* We also changed the content Query Web Part to handle taxonomy more efficiently.

After each change was implemented, we measured the change in performance. In each test, an improvement in bandwidth utilization was observed, particularly between the SharePoint® servers and the database; however, the end-user performance was unchanged.

Testing against an out-of-the-box installation

Next we tried to determine whether the entire SharePoint® installation would share this performance profile, or if it applied only to the instance that was being tested. The customer created a new out-of-the-box SharePoint® site using one of the example sites. We tested this site to 1500 users, and observed only slight degradation at the peak. The test was very near or past the bandwidth limits of the network connection, which was a 45 Mbps DS-3.

Average page durations are greatly improved
Figure 4: Average page durations are greatly improved – under 5 seconds up to 1500 users

Investigating Authentication

Now convinced that the OS, hardware and SharePoint® installation were healthy, we returned to the original site and targeted authentication. A new test case was designed that visited six public pages as an unauthenticated user. The system was tested and scaled to 1000 users, but performance was poor. Average page durations were in the 10 second range. The system was stable, but performance degraded rapidly by 1200 users, as we again hit the bandwidth limits.

Curious to see whether the improved results of previous test were due to a lower number of unique pages visited, rather than to authentication, we next designed a test case that visited a larger number of pages, both authenticated and not. This test included more pages than the first unauthenticated test, but a lot fewer than the original test scenario. This load test produced better performance, but was unstable, exhibiting a stalling behavior when under load. For example, the system ramped up to 1300 users serving about 30 pages per second, but as the test added further load, throughput suddenly dropped to fewer than 5 pages per second. We observed the same stalling behavior in multiple test runs at varying load levels.

System throughput scaled with load, then dropped to very low levels
Figure 5: System throughput scaled with load, then dropped to very low levels

Adding one test case causes instability

We next dissected the same test case into several iterations, to determine if any particular group of pages performed better or worse than others, but found no offenders. We then returned to a set of pages that did not require authentication, this time picking a larger set of pages containing a variety of features. There were 27 pages total. Load tests revealed the system could service these pages with average page durations under one second at 1500 concurrent users with consistent throughput of about 39 pages per second for two hours. Further experimentation revealed that the addition of one relatively simple test case caused the system to become unstable. Now we had an easy way to demonstrate how different usage patterns could yield good and bad performance of the system under the same configuration. We hoped this result would allow Microsoft SharePoint Support Engineers to offer some SharePoint-specific tuning advice.

Rebooting the database server improves performance

During some of the previous tests, we also noticed that system performance sometimes degraded consistently from one test to the next. We subsequently discovered that rebooting the database server between test runs temporarily improved performance. To help get consistency from the test results we began regularly rebooting all the servers prior to each test. This is actually a good test practice to ensure a consistent testing environment. Although we did not realize it at the time, the symptom of improved performance after rebooting was important, and later proved to be key to understanding the fundamental problems with the system.

Reducing the number of processors improves performance

After looking at our test results as well as collecting their own data, Microsoft SharePoint® Support indicated that SharePoint® was apparently unable to make use of such large hardware (8 processors with 16G of RAM). In an effort to validate that the problem was indeed caused by the large hardware, they recommended that we reduce the number of processors to 4, and then later suggested reducing it to 2. In each case, this resulted in a surprising performance improvement but the stalling behavior remained. Reducing the number of processors moved the point of failure, allowing the system to run longer before stalling, but did not cure the problem. We now had proof that the problem was unrelated to the size of the hardware and that it warranted more detailed, low level analysis.

Database Tuning

Early in the testing we had suspected that the database was the bottleneck. However, an analysis of database performance during the tests by both the customer’s in-house DBAs as well as Microsoft DBAs determined that locking contention was at low levels and the database was performing well. This had put the focus on the SharePoint® servers. It now seemed prudent to return our attention to the database.

Contention in allocation

After additional testing and data gathering, Microsoft Support engineers found that contention on tempdb allocations within SQL Server was causing delays processing queries from SharePoint®. This problem is described in the Microsoft Knowledge Base (#328551).

The fix required creating additional tempdb databases within SQL Server (one for each processor) and enabling a startup parameter (-T1118) that instructed SQL Server to use a round-robin tempdb allocation strategy. This change reduced resource allocation contention in the tempdb database, improving performance on complex queries.

Performance improved but instability continues

After making this change, load tests indicated that the system was able to sustain 15 pages per second at 650 users for 2 hours on a single server. Web page performance had improved, with average page durations down to the 2-4 second range. Specific changes to custom SharePoint® components and some additional database optimizations suggested by Microsoft Support brought average page durations under 1 second.

Page durations were greatly improved, but performance was not stable
Figure 6: Page durations were greatly improved, but performance was not stable

Although we had achieved a fast, stable system on a small subset of pages, the instability re-appeared when we re-introduced the remaining three test cases into the mix. The poor behavior appeared after roughly 80 minutes of operation at load. The failure was not as bad this time, and rather than stalling, the system’s throughput would suddenly drop by 30-50% and then oscillate up and down wildly.

System throughput is good but degrades severely and unpredictably
Figure 7: System throughput is good, but then degrades severely and unpredictably

Revisiting SQL Server and the rebooting fix

We now found ourselves wondering whether the SharePoint® Server or the SQL Server was the culprit. We recalled our discovery in previous testing that rebooting the database fixed the problem and brought it to the attention of the Microsoft Support engineers.

We also found that if we stopped the load test when the servers were in a degraded state and restarted within a few minutes, the degradation would continue, even at very low load levels. Further diagnostics around these symptoms revealed that once the system performance had degraded significantly, clearing the query plan cache in SQL Server (via DBCC FREEPROCCACHE) would restore system performance almost immediately. Unfortunately the fix was not permanent, and performance degraded again within a short period of time.

Single-threaded cache access in a multi-processor system

These discoveries led the Microsoft engineers to a Microsoft Knowledge Base article (#927396) that indicated problems with the size of the TokenAndPermUserStore cache in SQL Server. When the server has a large amount of physical memory (in this case 32G) and the rate of random dynamic queries is high, the number of entries in this cache grows rapidly. As the cache grows, the time required to traverse and cleanup the cache can be substantial. Because access to this cache is single-threaded, queries can pile up behind each other waiting for the cleanup to complete. This queuing slows performance and prevents a multi-processor system from scaling as expected. The remedy was to start SQL Server with a “-T4618” parameter, which limits the TokenAndPermUserStore cache size. (This was not one of the solutions listed in the Microsoft Knowledge Base for this issue – it was provided by a Microsoft Support Engineer).

Security Token Cache Size bug in SharePoint®

After the cache-limit fix was applied to SQL Server, the next load test of the system showed steady performance with 15 pages/sec and APDs under 1 second, supporting 650 concurrent users for 10 hours. However, in a subsequent load test, errors reading “Arithmetic operation resulted in an overflow.” started appearing in the pages, indicating that SharePoint® was unable to render many web parts on the page. Microsoft quickly traced this to a bug in a SharePoint® cache implementation that was fixed by reducing the SharePoint® Security Token Cache size. Apparently object cache throws Integer Overflow exceptions when cache size is greater than 2000.

With the above fix applied and tested, the system was ready for a longer stress test to judge the stability of the system over longer periods. The next load test ran for 48 hours at 650 users. The system performed well – easily satisfying the performance requirement with only a single SharePoint® web server. No degradation of performance was observed. Further testing with all three SharePoint® servers and higher load levels showed similar success.

A successful 48-hour test at 650 users
Figure 8: A successful 48-hour test at 650 users

Final Results

Prior to stress testing and tuning the website, it could handle only 100 users (4 pages/sec). With the improvements it handled 2000 users (45 pages/sec, nearly 800 hits/sec) with low CPU utilization (About 20%) on the servers. For reference, if held for an entire day, this rate would result in nearly 3.9 million page hits per day.

At 2000 users, CPU utilization of the servers is below 25% – the customer’s Internet connection is now the only factor limiting total capacity. With a higher bandwidth connection, it is possible that this site could now service up to 8000 users.

2000 user test shows high throughput and steady performance
Figure 9: 2000 user test shows high throughput and steady performance

2000 user test shows low page durations
Figure 10: 2000 user test shows low page durations

Servers show low utilization at 2000 users
Figure 11: Servers show low utilization at 2000 users

Twitter Digg Delicious Stumbleupon Technorati Facebook Email

View Comments to “Load Testing Reveals Cause of SharePoint Server Performance Problem”

  1. Amazing article!!!!!!
    You mention reducing the number of CPUs to 2. Did you do this through a hypervisor or through some other means. I don't think you mentioned if the servers were running in a virtual environment.

  2. Chris, thank you for that excellent article. I really enjoyed reading how you methodically and logically eliminated all possible scenarios to resolve the problem.

blog comments powered by Disqus