BGP Troubleshooting with Cisco Routers – Determining Upstream ISP Fault

Sporadic BGP issues can be difficult to troubleshoot sometimes, especially when the routing issues you face are further upstream (ISP controlled) and out of your control. Finding the root cause can be a problem. Part of the reason this is difficult is because Cisco Routers only log BGP neighbor state changes by default, and not “route update” or “route withdrawal messages”.

ADJCHANGE: neighbor x.x.x.x down

Consider the following network and issues below:

Customer Site 1:

R1 – AS 65505 – 10.17.184.0/24

R2 (ISP CPE)

Customer Site 2:

R4 – AS 65502 – 10.23.28.0/24
R5 (ISP CPE)

ISP PE:
R3 – AS 2202

Realtime traffic is dropping somewhere between R1 and R4 – Evidenced by IPSEC Tunnel down issues, and VOIP call drops. Drops are at random intervals. After looking into all other avenues of troubleshooting (Link Saturation, Internal Network Issues, Latency, Etc), looking into the upstream routing is the next step.

First we check the neighbor state between R1 – R2

R2#sh ip bgp neighbors

BGP neighbor is 10.0.23.210,  remote AS 64627, external link

Description: Level 3

BGP version 4, remote router ID 198.19.171.198

BGP state = Established, up for 1y10w

Last read 00:00:22, last write 00:00:13, hold time is 180, keepalive interval is 60 seconds

Configured hold time is 24, keepalive interval is 8 seconds

Minimum holdtime from neighbor is 24 seconds

Neighbor sessions:

1 active, is not multisession capable

Session: 10.0.23.210

The BGP session has been established between R1 and R2 for over a year, however that doesn’t tell us about any recent route changes.

Next let’s take a look at the route for the site we are interested in (10.17.160.0)

R2#sh ip route 10.17.168.0
Routing entry for 10.17.168.0/24
Known via "bgp 65505", distance 20, metric 0
Tag 64627, type external
Redistributing via eigrp 10
Advertised by eigrp 10 metric 100 1 255 1 1500
Last update from 10.0.23.210 5d02h ago
Routing Descriptor Blocks:
* 10.0.23.210, from 10.0.23.210, 5d00h ago

The last update for that route (Update being a withdrawal or advertisement) is 5 days ago. This indicates something happened upstream. Looking at the rest of the routing table confirms it.

R2#sh ip route | include 5d02h
B        10.0.23.32/28 [20/0] via 10.0.23.202, 5d02h
B        10.0.23.208/30 [20/0] via 10.0.23.202, 5d02h
B        10.0.137.144/28 [20/0] via 10.0.23.202, 5d02h
B        10.1.17.37/32 [20/0] via 10.0.23.202, 5d02h
B        10.1.17.38/32 [20/0] via 10.0.23.202, 5d02h
B        10.17.175.16/28 [20/0] via 10.0.23.202, 5d02h
B        10.17.183.0/24 [20/0] via 10.0.23.202, 5d02h
B        10.17.184.0/24 [20/0] via 10.0.23.202, 5d02h
B        10.17.185.0/24 [20/0] via 10.0.23.202, 5d02h
B        10.17.186.0/24 [20/0] via 10.0.23.202, 5d02h
B        10.17.187.0/24 [20/0] via 10.0.23.202, 5d02h

Reading this table, we can see that something happened 5 days ago to withdrawal/update all routes from our ISP peer (R3). The peer withdrew all the routes, since they clearly had a issue upstream within their service cloud. Unfortunate this is very easy to miss if you don’t read carefully.

This is a very basic way to troubleshoot BGP route changes when you have no other logging available. Recommendations to avoid this issue are to run "debug ip bgp update" on your edge routers when seeing events problems like this. Doing so will give you DEBUG level events in your console and syslog that will spell out the issue for you to pass on to the problem peer. The output of that command will look like this during a update on your peer router –

2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72990: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.188/30 down
2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72991: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.192/30 down
2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72992: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.200/30 down
2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72993: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.0/30 down
2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72994: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.4/30 down
2016-07-30 02:44:29    Local6.Debug    10.0.23.209    72995: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.32/29 down

Reference:

http://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/22166-b-trouble-main.html

Share this content:

Leave a Comment Cancel Reply