We are currently addressing a processing delay in ExchangeDefender antivirus scanning engine. One of our virus engine vendors had distributed a faulty update which has caused a backlog of messages that have been quarantined for further inspection.
In this case the corrupt message was passed on to ExchangeDefender which quarantined messages for further scanning which is far more expensive and processor intensive. We have responded immediately and removed the engine, however, even slight issues can cause huge problems when you process as much mail as we do and it has introduced a slight delay in the processing of messages. The issue started at roughly 3:10 and was resolved by 3:40. At the time of this message we see around 60% of our nodes processing messages within our ordinary SLA (seconds) and we expect the rest of the network to catch up shortly. If you experience any delays, even extensive in nature, it is due to the above problem which will within 30 minutes be completely under control.
We are conducting some routine maintenance on our Windows portion of the Virtual Hosting and Web Hosting network. We are applying patches, installing new hardware, general system maintenance task. All systems should be affected by a brief outage, and will be back within an hour at most. Update: Maintenance cycle completed.
We are currently addressing an issue with Exchange 2007 OWA. You will see the following text when attempting to login:
We will have this addressed momentarily and update this site. Update: 2:45: Problem has been solved.
We are conducting maintenance on our Offsite Backups architecture. 08:00–14:00 EST is our slowest time of the day and we’ll have the systems back online in time for the nightly backups.
Earlier today we had to flush the queues on ExchangeDefender outbound server due to the large number of corrupt queue files sent by one of our customers malfunctioning servers. If your messages were not delivered during the window between 5am - 7 am central (GMT -6) please resend them. The problem has been solved temporarily, but we will be holding an urgent maintenance window this Wednesday, 5/14, to address the core of the problem. P.S. Significant number of servers were backlogged during this process. That mail has been processed without issue.
We are currently tracking accessibility problems on mail1.ownwebnow.com. Please stand by while we research the issue, the server appears up but several customers are reporting access issues, we are trying to resolve them right now. We will update this site as soon as we have more information. The cluster is currently undergoing a reboot. Update: All issues have been resolved, a scheduler service hung on the load balancer.
We are currently investigating a network event in our Dallas region network centers, since approximately 5 AM EST we have been receiving complaints about network connectivity and availability. There are currently no outages and there have been no outages but certain customers are unable to reach the services on our network. If you are experiencing an issue, please open a support request and include a traceroute to the service you are trying to reach (ex: 65.99.255.50). We will update this ticket as soon as we have further information. Update (9:05 AM EST, 14:05 GMT): We believe the network issue some of our customers have experienced has been resolved. Particularly affected were some of our UK customers (not BT) and local customers with Level 3 connectivity. Although we see the traffic back up at usual levels, it may take about an hour or so until all the mail catches up and gets delivered.
We are currently working with RoadRunner (formerly Time Warner, AOL) service provider in United States, they are experiencing issues with their SMTP servers and randomly rejecting SMTP traffic. Currently mail is flowing through but some is bouncing back from them due to a reason they are still trying to narrow down. We will update when we have further information or a resolution. This issue affects our entire global network, and some external sites we have tested. Update: 6:34 PM EST: Even though we have not been officially updated, the problems with RoadRunner appear to have been resolved.
We will be starting the SP1 upgrade on our systems in roughly two hours (10 AM EST, -5 GMT) and expect to have all operations completed by noon. To minimize the surprises and potential conflicts, the entire cluster and all its members will be patched at once. That unfortunately does mean a bit of total outage, but it does minimize the chance of anything breaking in the process. Hosted Exchange has been one of our most solid products and we look forward to keeping it that way. Update: Exchange 2007 Service Pack 1 is now up and running, systems have been imaged and we are all done.
Earlier today we identified a major bug in the system that was used to generate statistics for SPAM email daily and intraday reports for some users. Although the issue affected only a few thousand people, I have chosen to pull it out of the production systems to avoid further confusion and lack of email report integrity. As soon as the bug fix is tested thoroughly, we will be placing it back into production. In the meantime, you will not see “Non SPAM Mail” total under statistics anymore. Problem Details ExchangeDefender daily and intraday reports are built using SQL queries against the mail log database. There are three queries executed for each report, one to obtain the SPAM messages, one to obtain SureSPAM messages and one to obtain the total number of rows in the table, both SPAM, SureSPAM and messages let through. Each SPAM query is executed within a check that verifies if the user settings are to store/quarantine junk mail because otherwise we have nothing to report if the messages are delivered and/or deleted. Totals for SPAM and SureSPAM are calculated within the respective settings check blocks. For example:
The problem with the Not SPAM count came in if the user did not store/quarantine their SPAM or SureSPAM which would mean the blocks of code that calculate the totals for the group would not get executed. The Non SPAM total would not get the correct amount of SPAM or SureSPAM subtracted from it and it would appear to the user as if they were missing messages because they surely were not receiving the amount that the report had indicated. Stupidity Details We figured we could save a few cycles by not running an extra query and total if the users did not store/quarantine SPAM or SureSPAM. Unfortunately, the equation for Not SPAM did not take that check into account and instead of subtracting the correct totals for SPAM and SureSPAM which are still logged but never reported, we were subtracting a zero thereby inflating the Not SPAM total for certain users. The good news is that it was simple enough to fix, sorry for all the frustration that has come out of this as both my support, my partners and my clients were seeing different results across the network. Considering I am responsible for the above I apologize for all the problems this has caused for you. Vlad Mazek
Over the past six months as the volume of SPAM has increased nearly exponentially, we are seeing more and more larger mail servers fail and start rejecting mail outright. Here is an update from 1&1:
We have been working on a new infrastructure upgrade to address some of these misconfigurations, some popular others not so much. We have directly investigated 100’s of “why didn’t my mail get there?” support tickets and in all but three (and orange.co.uk) the mail got to the recipients mail server without problem. As a result, we throttled down our notifications so that the users receive an alert within 3 hours and within 1 day of deliveries being delayed, deferred, rejected or dropped so that the users have a way to contact the person directly if the communication is urgent but the mail systems are not working as they should. Later this week we will put into production split mail relays over multiple networks that will implement the same intelligent routing technology we use in our inbound servers. Paperwork to get the agreements with larger ISPs (AOL, Yahoo) take a little while but we are confident they will get done this week. Some system changes will be required if you use SPF records and they will be noted here as we get closer to putting those into production.
In about 10 minutes (11 CST, -6 GMT) we will be shutting down the Offsite Backup infrastructure for the memory and networking upgrades to accommodate the new global replication platforms and proxies. The maintenance interval is expected to last less than an hour and we do not expect anything out of the ordinary.
Earlier this morning Yahoo and Yahoo UK & Ireland started experiencing problems with their RBL code. As a result, large number of messages have been rejected from our customer base to theirs. We are still working with Yahoo to resolve the issue, and a case is still open. Users that had their email bounced would have seen returns in their inboxes. Please ask them to resend the messages. We will update the ticket when the issue is completely resolved. Mail is flowing without issue at the moment, but the case with Yahoo is still open. We will update this site when the issue has been resolved completely.
We are currently performing some investigative maintenance on backup73.ownwebnow.com to assure data integrity and troubleshoot some failed logins by our clients. Please stand by, we will update with the resolution time momentarily. Update: Service restored.
Our Los Angeles data center carrier has suffered an HVAC failure, and the connectivity to the network has been severed for the time being. The facilities team is in touch with the building owner, service restore is under way. All services provided by this data center are unfortunately affected and down at the moment. Services affected: some ExchangeDefender, some SharePoint Hosting, some Virtual Servers. We will update this ticket when all services have been restored. This ticket is ranked urgent. Our priority will be to restore services that are not redundant first: virtual servers, followed by SharePoint hosting. Update (@ 3:00 AM PST -8 GMT, 6 AM EST -5 GMT): We expect SharePoint and Virtual Server services to be restored around 6 AM PST (-8 GMT). ExchangeDefender services are not impacted (please be patient with SPAM releases however). We will update this ticket at 6 AM or when services start coming back online. Update (@ 3:44 AM PST -8 GMT, 6:44 AM EST -5 GMT): All services have been restored. Total LA DC1 outage: 53 minutes.
At roughly 5 AM EST (GMT -5) our primary backup proxy server in Dallas, TX went down for basic hardware maintenance. Upon restart, the primary RAID array controller lost its boot configuration and the system hung after all the drives were initialized. As you may imagine our backup infrastructure is huge and a restart can take up to 30-40 minutes to spin up all the drives over all controllers. It takes a while to determine the issue when all hardware reports correctly, the case was escalated and addressed right away. We are sorry about the inconvenience. Note: Our storage infrastructure does not follow the same maintenance interval as the remainder of our network. While almost all of our services have the least amount of usage during early Saturday morning hours (EST), offsite backups tend to have the strongest usage during those hours. Large backup sets are usually scheduled to start Friday afternoon after most 9-5 workers leave and it generally runs through the weekend. Likewise, we do global network snapshots over the weekend right before major maintenance tasks on the network. For this reason, our maintenance window for offsite backups is pushed up one day.
Generally this blog is used to document any and all activity related to something that is broken. Since things have been working quite well lately and we’re mostly on new projects and performance related tasks I wanted to update you on the status of the entire network. ExchangeDefender - Solid. No issues, no latencies, no delays and overall flawless performance. Over the past week we have not had any delay reports that turned out to be on this end, we have had no DDoS issues to report, we have our lowest latency ever (between 5-10 seconds, network-to-network) and fewest false positives in a while. Hosting - Solid. No issues, working on Exchange 2007 SP1 upgrades and 2003-2007 migrations as well as WSS2 to WSS3 migrations. We are working on integrating our new global points of presence in Australia and Europe and they are coming together well. Offsite backups - Solid. No major issues, one patch and hardware enhancement scheduled for this weekend to help with large file backups. VoIP - We are adding another IAX2 provider to our network, should become available shortly. Training - Mostly scenario and interface tests for ExchangeDefender 4.0. Bringing on new materials, including Shockey Monkey 2.0. Web site undergoing a major update. That is all. It’s so boring it already feels like the summer.
ExchangeDefender 4.0 engine has been online for a few days now and as of Monday/Tuesday night (USA) time we have addressed all the outstanding issues regarding latency, non-delivery, rejections and the garden variety of performance problems. Today we had a relatively flawless day, with the highest SPAM detection rate ever, lowest false positive day ever and just the best thing we could have hoped for given the very smooth upgrade. With that in mind, we want to beg you to open a support ticket if you see or notice anything even minor with the performance and reliability with ExchangeDefender over the last 24 hours at most (don’t bother going back further than that as we cannot address the problems that have very likely already been fixed). So here is to a great ExchangeDefender 4.0 engine. If you see anything unusual please open a support request and we will investigate it with the highest priority free of charge. We absolutely appreciate your help and the time you put in to help us improve the product.
From approximately 10 AM - 2 PM EST, our shared mail hosting platform (mail1.ownwebnow.com) suffered a large scale distributed denial of service attack (DDoS). Everything is under full control now and we have been able to filter out the offending systems. Unfortunately, there is little that can be done in terms of scale and protection against a DDoS as we already have both Cisco and Tipping Point in place. DDoS attacks tend to be flared up by the regular user activity, as the system slows down end users keep clicking Send/Receive and effectively flood the connection until it times out. Systems are back to normal and messages are starting to arrive in regular sequence.
We are currently in the process of bringing the next generation of reporting infrastructure to ExchangeDefender. The new grid is currently being activated, will be completed within two hours. This work is meant to create a more reliable way to deliver email reports. We will update as soon as everything is back to 100%. Update: (8:00 GMT, March 5th, 2008): Maintenance window completed after 53 minutes, 8 seconds. All systems are back to normal and are catching up. We expect the transaction latency to get back to realtime within the next 20 minutes. Update coming shortly on resuming of daily and intraday reports. Update: (9:30 GMT, March 5th, 2008): Currently working on restoring intraday, daily and ondemand reporting features. We expect the work to be completed by noon, EST. Update: (15:00 GMT, 10 AM EST, March 5th, 2008): Maintenance completed. Reports (daily and intraday) will resume within a few minutes (at 10 AM central) and will be faster than before. This also opens up a whole new range of reporting options within ExchangeDefender which will launch with ExchangeDefender 4.x. We will also make another offering available immediately to address those of you who have a critical need for ExchangeDefender reports. |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

