Sporadic BGP issues can be difficult to troubleshoot sometimes, especially when the routing issues you face are further upstream (ISP controlled) and out of your control. Finding the root cause can be a problem. Part of the reason this is difficult is because Cisco Routers only log BGP neighbor state changes by default, and not “route update” or “route withdrawal messages”.
ADJCHANGE: neighbor x.x.x.x down
Consider the following network and issues below:
Customer Site 1:
R1 – AS 65505 – 10.17.184.0/24
R2 (ISP CPE)
Customer Site 2:
R4 – AS 65502 – 10.23.28.0/24
R5 (ISP CPE)
ISP PE:
R3 – AS 2202
Realtime traffic is dropping somewhere between R1 and R4 – Evidenced by IPSEC Tunnel down issues, and VOIP call drops. Drops are at random intervals. After looking into all other avenues of troubleshooting (Link Saturation, Internal Network Issues, Latency, Etc), looking into the upstream routing is the next step.
- First we check the neighbor state between R1 – R2
R2#sh ip bgp neighbors BGP neighbor is 10.0.23.210, remote AS 64627, external link Description: Level 3 BGP version 4, remote router ID 198.19.171.198 BGP state = Established, up for 1y10w Last read 00:00:22, last write 00:00:13, hold time is 180, keepalive interval is 60 seconds Configured hold time is 24, keepalive interval is 8 seconds Minimum holdtime from neighbor is 24 seconds Neighbor sessions: 1 active, is not multisession capable Session: 10.0.23.210
- The BGP session has been established between R1 and R2 for over a year, however that doesn’t tell us about any recent route changes.
Next let’s take a look at the route for the site we are interested in (10.17.160.0)
R2#sh ip route 10.17.168.0 Routing entry for 10.17.168.0/24 Known via "bgp 65505", distance 20, metric 0 Tag 64627, type external Redistributing via eigrp 10 Advertised by eigrp 10 metric 100 1 255 1 1500 Last update from 10.0.23.210 5d02h ago Routing Descriptor Blocks: * 10.0.23.210, from 10.0.23.210, 5d00h ago
- The last update for that route (Update being a withdrawal or advertisement) is 5 days ago. This indicates something happened upstream. Looking at the rest of the routing table confirms it.
R2#sh ip route | include 5d02h B 10.0.23.32/28 [20/0] via 10.0.23.202, 5d02h B 10.0.23.208/30 [20/0] via 10.0.23.202, 5d02h B 10.0.137.144/28 [20/0] via 10.0.23.202, 5d02h B 10.1.17.37/32 [20/0] via 10.0.23.202, 5d02h B 10.1.17.38/32 [20/0] via 10.0.23.202, 5d02h B 10.17.175.16/28 [20/0] via 10.0.23.202, 5d02h B 10.17.183.0/24 [20/0] via 10.0.23.202, 5d02h B 10.17.184.0/24 [20/0] via 10.0.23.202, 5d02h B 10.17.185.0/24 [20/0] via 10.0.23.202, 5d02h B 10.17.186.0/24 [20/0] via 10.0.23.202, 5d02h B 10.17.187.0/24 [20/0] via 10.0.23.202, 5d02h
- Reading this table, we can see that something happened 5 days ago to withdrawal/update all routes from our ISP peer (R3). The peer withdrew all the routes, since they clearly had a issue upstream within their service cloud. Unfortunate this is very easy to miss if you don’t read carefully.
This is a very basic way to troubleshoot BGP route changes when you have no other logging available. Recommendations to avoid this issue are to run "debug ip bgp update"
on your edge routers when seeing events problems like this. Doing so will give you DEBUG level events in your console and syslog that will spell out the issue for you to pass on to the problem peer. The output of that command will look like this during a update on your peer router –
2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72990: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.188/30 down 2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72991: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.192/30 down 2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72992: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.0.23.200/30 down 2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72993: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.0/30 down 2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72994: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.4/30 down 2016-07-30 02:44:29 Local6.Debug 10.0.23.209 72995: chip01rtr2: Jul 30 09:44:28.987 GMT: BGP(0): route 10.23.192.32/29 down
Reference:
http://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/22166-b-trouble-main.html