2

Hi Fellow Network Enthusiasts!

I'm having a hard time troubleshooting a Real Time Stocks Trading Application. Here are the Steps I've done so far:

Apparently, this is the only Real-Time Application in the company I'm working at that's experiencing slowness and everyone really likes to blame the network guy(Me) whenever an issue like that happens.

Performed Traceroute from App to DB, Client to App, DB to App, traceroute averages at 4ms to 9ms.

Monitored CPU Utilization, Memory Usage, Packet Loss and Link Utilization in Solarwinds

The Maximum Stats on SOME network devices are:
CPU Utilization: 30%
Memory Usage 73%
Packet Loss 0%
Link Utilization: 60%

This doesn't seem helpful except for baseline purposes since it looks okay, yet they're still experiencing slowness

I have also performed Sniffing using Wireshark. At first I only sniffed one device at a time and that didn't give me the details so I sniffed two devices at a time: The Switch closest to Trade Application Server and the Switch closest to the Database Server. I compared their I/O graphs and I've found out that the delay of packet transmission from App to DB is the same on both switches but the latency of DB to App is different on both switches. The switch closest to the db shows 10ms or less latency while the switch closest to the Trade server shows 100 - 254ms delay!.

One problem I'm having with wireshark though is that all our switches (except those that connects to the client's PC) perform loads balancing so it's hard to predict which packet it is. Also, the SQL and the FiX protocol is not fully supported by Wireshark yet so it's hard to check that. I'm currently just looking at the delta times. However, I don't know how to check if the high delta time is because of the Application Response time or the Delay in Network transmission. The Server guys don't know how to check the ART -_-'

So I'm currently investigating the very high delay. My problem is, the are a lot of network devices in between the App, DB and the Client's PC, around 30 devices including the 2 two firewalls. I'm planning on sniffing at least all the devices between App and DB excluding the firewalls and that would take me 6 Laptops running at the same time.

Before I perform that though, I have a few questions:

  1. Am I doing the troubleshooting process correctly?

  2. Are there other things I could look at/check?

  3. Does the number of network hops increases the latency? If yes, does it vary between switches, routers and firewalls?

  4. The vendor for the trading application recommends to implement QoS for this trading application. Is that really necessary?

Thank you in advance for all your help and sorry for the long block of text. :)

Mike Pennington
  • 29,876
  • 11
  • 78
  • 152
Rufi
  • 111
  • 1
  • 2
  • 3
  • Whenever I hear large deltas in delays depending on direction over symmetric paths, checking for duplex/speed mismatches on links might be worthwhile. – generalnetworkerror Aug 09 '14 at 02:06
  • This looks like the same problem as your newest question, as such I marked this one as a duplicate of your newest question. If I am mistaken, please let me know. If I am correct, please merge any missing information from this one into your new question – Mike Pennington Aug 14 '14 at 08:01

1 Answers1

1

It's hard to tell for sure without more info, but the fact that a host close to the db is fine, but one farther away is slow does suggest network congestion. Also, 60% link utilization is high considering it is usually averaged over a 5 minute period. You could very easily be saturating some links for short periods. But in answer to your questions:

  1. I don't think the extra sniffing will tell you anything more. You are already seeing the delay. Check your captures to see if you are experiencing retransmissions. Also check for drops on each individual link. Oh, and since you have a complicated network, you should spend the time figuring out the data paths. That will be necessary for troubleshooting.

  2. If you have Cisco devices, you can change the port averaging time to 30 secs, which may give you a better sense of your link utilization. You can also look at individual ports for dropped packets. If you have it, Netflow might be helpful too.

  3. Assuming you have commercial grade switches, the extra latency caused by them will be negligible.

  4. If you do in fact have network congestion, then yes, QoS will help out by prioritizing that traffic over all the cat videos everyone is watching.

Ron Trunk
  • 66,852
  • 5
  • 65
  • 126