[GIS] Issue with recently cached ArcGIS services

arcgis-servercache

ArcGIS for Server 10.0 SP5

Virtual Windows Server 2008 R2 Enterprise

16.0 GB RAM, 4 Processors (2.00 GHz each)

I am having issues with two services that I have cached as basemaps for my web applications.

Here's a little back story – These services have been in use for a couple years, and every so often (about 6 months) these caches are rebuilt to address changes in basemap items (i.e. street centerline changes). We use a variety of different basemaps in our web map applications, and this weekend I recently updated the cache on two of them since it takes so long to do them in our current setup (one machine environment).

These services worked just fine, then after recaching them they appear to be fine, until a user spends a little more than 2 minutes on an application using the service. Then the basemap tiles begin to lag and not load. Network debugging shows the pending processes, and SOMETIMES the tiles do load after about 1-2 minutes. MOST times they do not load for at least 4 minutes. Remaining times they just stop loading, and process stay pending until they timeout with a 500 server error.

I have an open ticket with Esri, but they have yet to pinpoint the issue. Some changes we have tried include:

  • Editing the impersonation for the rest and services config files from true to false
  • Running post installation
  • Changing services to low isolation and increasing instances to 8

Results for each change are as follows respectively:

  • Caused Bad 404 errors for REST page.
  • This was done in to try and fix above 404 error, it did not. Impersonate values changed back to true to fix 404 error.
  • No changes seen with services. Same lag issues as before.

In addition to the above, the ArcGISServiceAppPool process periodically will cause the CPU on the server to peg out at 100%. I have seen the KB articles (w3wp.exe, although it is for 9.x versions, and lsass.exe) related to this issue. Implementing the suggested fixes are what caused the 404 issue.

I read through this post and it relates to some of the issues I am also experiencing. I have checked the log and do not see anything that helps identify the issue at hand.

This question is also posted to ArcGIS forum. User asked if service caches were being created on demand. Answer: They are not, however both with and without this option have been tested and the results are similar.


I believe I may have found why the issue is occurring only at some levels. It appears that some tiles are missing at these levels. Since some tiles are missing, users still pan/zoom which then become network processes that never complete. If enough of these add up, it bogs the server down (since most other requests are fulfilled, it's hard to monitor performance).

I think from here I am going to recache the map at the last three levels, then copy the updated levels to the original cache and see if this fixes the issue.

Best Answer

It appears my assumption was correct. This is what I did to fix the issue:

  1. Created a copy of the service
  2. Copied the cache from the original service to the new one
  3. Deleted the cache at the levels where tiles appeared to be missing (in my case, the last three levels)
  4. Recreated all tiles at the levels I deleted
  5. Copied the levels from the new service back to the original

What I don't understand is why no errors were thrown when these tiles were not created originally. The geoprocess completed without error. It should throw an error if a tile cannot be created, or at least a warning for that matter.

It would also be nice if a python script could be made to identify where missing tiles are, reporting back their scale level and ID. This way users can recreate the tiles only at the levels that have issues. I had to blindly guide my way through the entire basemap, taking notes of what scales had issues.

All of that being said, I am still not sure if this will fix the 100% CPU when the w3wp.exe process for the ArcGISServiceAppPool. The suggested KB article for changing the impersonate values IS NOT a straight forward fix, nor did it correct the issues we are experiencing. Our development team suggested scheduling the recycling time for the app pool to occur at a specific time. I tried this, but it doesn't seem to have fixed the issue.

I know this post really is about two separate issues, but I have a feeling they may be related since the only time we seem to experience the CPU issue, is when the services also have issues. Hopefully the steps I provided above will help others that run into similar issues.