- Priority - Critical
- Affecting Other - Magnavoip
Between 2:30pm and 3:00pm on 10/7/2017, Data102 observed multiple T1/PRIs lose connetivity to an upstream voice carrier.
These circuits carry the majority of traffic for the Colorado Springs/719 region of our service. Customers with service via this carrier have been unable to make or recieve calls for the duration of the outage.
As of 4:30pm, we have remapped all of our outbound dialing to avoid these circuits; inbound service continues to be impacted.
** Update 4:40pm **
We have learned that our upstream carrier suffered a power outage due to vandalism at their head end. Power has been restored, and services are being repaired.
** Update 6:05pm **
Notified of inbound service restoration, and was confirmed at approximately 6:10pm.
** Update 6:43pm **
Notified of outbound service restoration, reconfigured trunks and tested outbound service at 7:38pm (not urgent, as outbound was being processed through alternate upstreams without issue)
** Update 10/9/17 @ 11:30am **
We are aware of intermittent inbound dialing, and of an "All circuits are busy" message occasionally playing on connect. This is due to various underlying circuits being down on our trunk group, and some calls attempting to use those downed services. We have remedied more than 50% of those, and are troubleshooting the additional capacity.
** UPdate 10/9/17 @ 12:32pm **
Technicians are en route to the far end of one of our down hi-cap circuits to replace equipment that was damaged during the outage. Upon replacement and cable swapping we expect full operation.
** Update 10/9/17 @ 2:16pm **
Full service has been restored across all uplinks and carriers.
- Date - 10/07/2017 14:29 - 03/21/2018 22:23
- Last Updated - 10/09/2017 14:21
- Priority - Low
- Affecting System - DirectMX
We have become aware that the DirectMX platform has failed to update anti-spam settings for at least 24 hours, resulting in a slight increase in spam being delivered to end users. We are aware of this issue and are actively investigating.
This issue has been isolated to a hung process, and has been resolved, and definition updates have been completed.
- Date - 12/09/2016 11:41 - 08/09/2017 11:59
- Last Updated - 12/09/2016 12:21
- Priority - Medium
- Affecting Server - portal.magnavoip.com
Earlier this morning Data102 became aware of intermittent outbound calling problems, where end users would place a call and receive a message indicating "Destination Unavailable" or similar. Inbound calls were not impacted.
Issue Root Cause
After debugging, Data102 determined that an upstream carrier's voice gateway became unavailable and unresponsive at an unknown time in the morning. Due to this failure, an alternate service was used for call completions. This failover service has a limited number of call paths available (46), and intermittently this alternate service became congested, resulting in a failure to complete outbound calls.
After determining that the main voice gateway was unavailable, Data102 re-configured the call platform to forward calls to an alternate voice gateway. This promptly resolved the call flow issue, and service was restored. Data102 has put additional monitoring in place of upstream service provider availability, and setup additional alarming and auto-redirection services in case of gateway failure.
- Date - 10/12/2016 09:29 - 10/12/2016 10:12
- Last Updated - 10/12/2016 14:55
- Priority - Critical
- Affecting Server - sip.data102.com
- Outage Scope: Customers using MagnaVoIP phone services behind NAT firewalls without ALG
- Outage Date: 2016-07-11
- Outage Type: One-way audio / loss of service
- Severity: Critical
- Root Cause: Bug causing incorrect NAT setting on customer-facing SIP proxy
On 2016-07-10 @ 5:30am, Data102 completed a software upgrade of the MagnaVoIP platform, and throughout the day monitored and confirmed numerous active and final call completions.
At approximately 8:05am 2016-07-11, multiple reports indicated that customers were experiencing one-way audio. Immediate troubleshooting began, and at 8:48am, vendor support was engaged.
Vendor confirmed that the issue existed as expressed, but did not have a fix or a root cause.
At approximately 1:30pm, the decision was made to roll-back the previously performed upgrade. The rollback began at ~2:15pm, and all service was confirmed to be restored by 2:22pm.
Customers who have their SIP endpoint behind a firewall or similar device running NAT, that did not have a SIP Application Layer Gateway, and were registered to sip1.magnavoip.com were impacted by the service. This predominantly consisted of hosted handset customers. Customers with public IP addresses on their SIP endpoint, including IADs, Hosted vPBX, or byo-PBX with a public IP address were not impacted.
Customers who were impacted experienced one-way audio, and were unable to hear the remote party on the phone call.
Root Cause Analysis
After procuring tremendous amounts of debug data, and sending it to our software vendor, alongside numerous retests of the situation on our development & testing environment, we were able to deduce the following:
Data102's SIP1 customer-facing proxy was configured for "nat=0" rather than "nat=1". This resulted in the media handling to disregard whether not SIP packets were NAT'd, and to trust the supplied SDP; as such, RTP/media was sent to the private IP address of the handset, rather than to the Public->NAT'd IP. These private IP addresses are unroutable and unreachable, and thus no audio reached the handset.
The aberration in the NAT configuration does not present as an issue on the current 3.0x software platform due to a bug which disregards the NAT setting, and always forces "nat=1" regardless of the setting or the reality. This bug is fixed in the 4.x software, resulting in the media being handled 'correctly' as configured -- without NAT workarounds -- resulting in one-way audio.
This situation was not detected in our extensive development and testing routines because the default setting is "nat=1", and this configuration setting is not visible in the web UI, and is only accessible via the database. As such, it would have been impossible, on a fresh install for dev/test, to see this situation, as it can only exist on an upgrade, where such a pre-existing broken configuration would have been unearthed, "fixed" and resulted in a service impact.
Data102 resolved the immediate service impact by rolling back to the snapshots taken prior to the software upgrade on 2016-07-12. This immediately restored all functionality.
Data102 completed very extensive testing on the development environment, with vendor involvement, to 100% confirm where the configuration problem lies, how the bug impacted such a configuration as well as how to fix it. Data102 is able to reproduce the behavior, and the solution, on demand, and is confident that the configuration fix is viable. Data102 plans to re-initiate the 4.x upgrade at a later date.
- Date - 07/11/2016 08:00 - 07/11/2016 14:22
- Last Updated - 07/13/2016 14:17
- Priority - Critical
- Affecting Other - SAN02-COS
Service: CDP Backups
Yesterday at approximately 3:05 PM MST, Data102's Colorado Springs-side SAN02 lost two hard drives during a patrol read, destroying a 6-drive RAID5 set. At approximately 5PM, we completed swapping hard drives and began background-initializing the array, with estimate completion time of 48 hours.
This morning at ~9:15AM, we began the resync process - copying the data from our Denver-side SAN01 to SAN02 to restore all services. This process has an estimated completion of ~33 hours.
In the meantime, to ensure data consistency, all CDP services have been temporarily disabled.
01/01/16 @ 12:45pm: The initialization and copyback are proceeding, at 37% and 8.5% respectively.
01/02/16 @ 08:30am: Process continues, 57% , 59%
01/02/16 @ 11:50pm: Rebuild and data transfer complete
01/03/16 @ 08:15am: All backup jobs started
01/03/16 @ 11:30am: Backup jobs audited, repaired as needed
01/04/16 @ 09:28am: All backup jobs appear to completed as intended
Our apologies for this significant inconvenience. We understand that backups are important, and are working as diligently and as swiftly as possible to restore service without losing any data. Should the need arise to take a backup or to restore a backup, we MAY be able to facilitate your request, depending on the location of your data. Please open a trouble ticket with your request and we will respond promptly.
Data102 Operations Team
VP/Ops Randal Kohutek
888-328-2102 / 719-387-0000
- Date - 01/01/2016 09:30 - 01/04/2016 09:30
- Last Updated - 02/05/2016 12:40