/* Partykof: July 2010 - Managing information and Technology */
In this blog, I am summarizing some of my work so far and the issues I'm facing everyday in my work as an IT professional.
You are welcome to follow, comment and share with others. If you want to drop me a private note, send me an e-mail


Friday, July 30, 2010

Troubleshooting problems in linux, based on a sample for DD-WRT web GUI not responding

In this post I am going to present a sample troubleshooting procedure for a linux box, where the web interface suddenly stop responding after few weeks of normal operation. I will present the use of basic tools embedded usually in any linux box, and an external monitoring tools based on MRTG.

I use a Linksys WRT54GS wireless router running DD-WRT v24-sp1 mega firmware. It is a small appliance that is based on Broadcom BCM4712 chip and is running a scale down linux OS. Since I installed this version I noticed that once in a while I am unable to access the web interface of the router. The simplest solution was to power cycle the router by unplugging its power plug out, but that meant getting to my router which sits in somewhere in the attic.  I decided to try and figure out what was it that was causing that.

First, I configured SSH access to the router, so I would be able to remotely connect to it, and reboot it in case I needed to. I also configured SNMP monitoring for it, to collect statistics of its performance.
Once the problem reoccurred, I was able to connect to the router and run a simple top command to see what processes are running and see if it can help me figure out the problem.

Figure 1: Console view of top output

Immediately I've noticed that the router load is high, and the process that is causing that was the web server daemon, httpd which was consuming 98.2% of the cpu.
Wondering when the problem started I turned to the RRD graph and noticed that it has been going on for more than 3 weeks, at the beginning of week 28.


Figure 2: Weekly view of router CPU load

In Figure 2, you may clearly notice that the router load has dramatically changed above the load value of 1, which means that the CPU was working at 100% and was queuing processes, which in turn means performance degradation.
I tried correlating the problem to memory or traffic incident at the time the problem started. Figure 3, shows the memory utilization of the router and Figure 4 shows inbound and outbound traffic on the router WAN bridge.


Figure 3: Weekly view of router memory usage
 

Figure 4: Weekly view of traffic on WAN interface

Looking at the beginning of week 28 of both graphs, I found no relation to any issue at the time the problem started or that these parameters would cause this problem.

Another point that might cause an effect is the system's disk capacity, but in such a small router, the whole file system is always presented as 100% full, so this would not present an indication for a problem.

With no luck figuring out the cause of the problem, but only the symptom, I googled it, and guess what, it is a know issue. According to others in the DD-WRT community, the problem is caused from using intensive P2P services, but currently there is no resolution for it, but to use the Mini firmware version.
Since I need the Mega firmware version for VPN and VOIP, I cannot afford to downgrade my router. So the best way is to live with it. To make life easier, I wrote a small script that I can run remotely that will restart the web service, without even having to interactively login to the router.
   #! /bin/sh
   stopservice httpd
   startservice httpd 



You can view a nice reference for doing this procedure in this Link

In summary, although this is only a small linux box, or a router, the basic procedure to identify a problem or its symptoms are the same, you should look at the system at normal operation and compare any irregularity to that steady state. The use of MRTG tools to collect statistics for reference is very important and useful for troubleshooting or capacity planning.

-Nir

IBM acquires Storwize, A real-time in-line lossless data compression

A new announcement is spreading across all storage magazines saying that IBM announced today that it has decided to acquire Storwize which provides real-time data compression technology.

About Storwize
Storwize, headquartered in Marlborough, MA, with an R&D office in ISRAEL, provides online storage optimization through real-time data compression. Storwize's Random Access Compression Engine™ (RACE), applied in its STN appliances, transparently (in-line) compress primary storage up to 80 percent. They promise random access and deterministic, lossless data compression with no reduction in performance.

Key Values
The Storwize solution value is based on three issues.
  1. It is based on existing industry LZ compression algorithms, such as the one being used in standard tape backup operation, but its revolutionary idea is that it does it in real-time with no data loss.
  2. It is very simple to deploy; it is a plug&play solution that is seamless to day to day operation, installed in less than 30 minutes. Compression can begin immediately for new data; old data is compressed seamlessly over time.
  3. It presents immediate ROI - it allows a significant saving from day one and enables bigger operational capacity in storage and performance with current investment. 
 Figure 1: Typical Storwize solution

Advantages with current storage investment 
  • An implementation of the Storwize solution will provide the following benefits. 
  • Compress the data on existing network storage systems and save the next disk purchasing investments.
  • Compress data going in to the storage systems, which means it will extend the performance capacity of the current systems to a longer periods and delay the acquisition of such systems.
  • Immediate boosted to user experience will be as a result of this reduction of load on the storage system.
  • Reduce the need for users to sort, delete or compress their files and keep up with their current quota, hence freeing users from tedious tasks, and focusing on real work. 
  • Recovery time from tapes will be reduced dramatically, as less data will be transferred from the tapes to the disks. 
  • Power saving - Green computing - when using less disks to store compressed data, you save the power of the disks shelves that were needed for the uncompressed capacity. 
  • Smaller footprint - Floor space savings, when using less disks shelves, you delay the need to expand the expensive data center floor space. 

Risks to consider 
As any new solution to be integrated in to your computer environment and being it relatively new technology on the market it presents several risks that must be address or at least be aware of.
  • The appliance is placed in-line between your network switch and the storage system, which means it is another failure point in the critical path of your environment. -Precaution: Make sure you deploy Storwize’s fault tolerant solution to avoid single point of failure.
  • Overlooked compressed configuration could result in data loss  - Precaution: Set configuration control procedures and change management to avoid faults. 
  • Introduction of a totally new system with no prior experience  - Precaution: Seriously consider holding training sessions for IT personnel who will manage this environment. 
  • Scale out lockdown when using Storwize solution with new NAS technologies. No support for global/shared name space - Precaution: Consider deploying this solution on isolated controllers at least until Storwize offer a solution for Persistent namespaces. 
Conclusion 
With over 18 months of experience of working with this solution, I can say, the results it presented were great. I noticed a very good compression ratio of typical data on the storage systems, while presenting performance improvements. Some configuration issues were discovered early in the deployment however they were immediately resolved by Storwize.  This solution is indeed revolutionary in its concept and the results. It presents many advantages and some risks which should be addressed as advised if this solution is to be considered.

-Nir

Tuesday, July 20, 2010

Configuring a server for optimal performance


The preceding posts have illustrated the major building blocks that effect server configuration; I explained the importance of each one and the priority of adding it to the system.
If you missed them you can check these links:
In this final post of server configurations, I will present examples of configurations and areas where they should be applied.

Major Configurations
The configuration of a server is derived from its target application requirements. There are four major configurations

  1. Maximum Performance 
  2. Balanced Performance 
  3. Maximum Capacity 
  4. RAS configurations

Maximum Performance 
    This configuration is intended to get the maximum CPU frequency, and maximum memory bandwidth. It usually uses low count of memory, as you populate only one DIMM per channel (i.e 6 DIMMS overall). The common use for such servers is for High Performance Computing ( HPC) in research organization, Oil & Gas industry and Chip Design.  
 Figure 1:  Maximum Performance

Best configuration at the time of publishing this post:
  • CPU - Intel Xeon X5680 (3.33GHz), 6 cores per processor.
  • Memory - 6 PC3-10600 DIMMS (such as Kingston KVR1333D3D4R9SK3/24G) to allow 48GB of RAM, at 10.6GB/s bandwidth to memory.

  Balanced Performance 
    This configuration is focused on getting a balanced configuration between the maximum CPU frequency, and maximum capacity of memory. It usually uses medium count of memory, up to 96GB per host. The common use for such servers is for virtualization and other standard enterprise applications.  
 Figure 2:  Balanced Performance

Best configuration at the time of publishing this post:
  • CPU - Intel Xeon X5680 (3.33GHz), 6 cores per processor.
  • Memory - 2 DPC, 12 PC3-8500 DIMMS (such as Kingston KVR1066D3Q8R7SK3/24G) to allow 96GB of RAM, at 8.5GB/s bandwidth to memory. 

  Maximum Capacity
    This configuration is focused on getting a configuration that will support the maximum capacity of memory, with a considerable compute power. It usually designed to use as much as 144GB of RAM per host  ( 296GB with the upcoming 16GB modules). The common use for such servers is for very large scale database servers.  
 Figure 3:  Maximum Capacity

Best configuration at the time of publishing this post:
  • CPU - Intel Xeon X5680 (3.33GHz), 6 cores per processor.
  • Memory - 3 DPC, 18 PC3-8500 DIMMS (such as Kingston KVR1066D3Q8R7SK3/24G) to allow 144GB of RAM, at 6.4GB/s bandwidth to memory. 

 RAS Configuration
    RAS stands for Reliability, Availability and Serviceability.  Although the ECC technology offers error correction, it does not provide any failover capability. Replacing a DIMM in case of failure requires a power down of the system. The RAS configurations offer three memory protection options:
    1. Online spare memory mode
    2. Mirrored memory mode
    3. Lockstep memory mode
       
              This configuration uses only two out the three channels.

     Figure 4:  RAS configuration

       Online spare memory mode
        In this mode, one of the channels is designed as spare. This channel is not used in normal system operation. If a working DIMM exceeds the threshold of correctable memory errors, the system switches to the standby channel and the faulty channel is taken offline. 
         
         Mirrored memory mode
        In this mode, the same data is written to each channel and the read is alternated between the two channels. If a working DIMM exceeds the threshold of correctable memory errors in one of the channels, the faulty channel is taken offline and the system switches to using only one channel. 
         

         Lockstep memory mode
        This mode uses two memory channels at a time, and the work as a single channel. Each read and write operations moves a data word two channel wide. To provide double 8-bit error correction within a single DRAM. This mode is the most reliable but it reduces the maximum memory capacity as the third channel is not used.

      Summary
      By now you should have the tools to configure your server for the optimal performance you will need for your application. You should focus on the application's memory requirements and start from that point to configure how much memory you should use and in which configuration of ranking and population.

      -Nir

        Monday, July 19, 2010

        Populating DIMMs considerations, Order and Ranks

        The Nehalem and Westmere platforms offer a wide variety of DIMM configurations. Some of the various DIMM configurations are shown below

        Feature
        Values
        Number of DIMMs 1,2 or 3
        Number of DIMMs slots per channel 2 or 3 DIMM Slots
        Number of DIMMs populated per channel 1,2 or 3 DIMM per channel
        DIMM Frequencies DDR3-800, DDR3-1066, DDR3-1300
         Table 1:  DIMM Configurations

        Populating DIMMs within a channel

        When populating DIMMs in a three slots per channel configurations, a “fill-farthest” approach is used, meaning, the farthest DIMM from the processor is used first. If a Quad-rank DIMM is used, it should be populated first.
        Figure 1:  DIMM Population within a channel

        DIMM population in an 18 DIMM slots configuration

        CPU1
        CPU2
        Slot Number
        Population Order
        Slot Number
        Population Order
        Channel1
        1
        G
        1
        G
        2
        D
        2
        D
        3
        A
        3
        A
        Channel2
        4
        H
        4
        H
        5
        E
        5
        E
        6
        B
        6
        B
        Channel3
        7
        I
        7
        I
        8
        F
        8
        F
        9
        C
        9
        C
         Table 2:  DIMM Population in 18 DIMM Slots

        Additional population requirements
        1. All DIMMS must be DDR3 DIMMs.
        2. The 5600 series support low voltage DDR3 memory (DDR3L) 1.35V, the 5500 supports only 1.5V, if mixed they will work at 1.5V.
        3. Mixing Registered and Unbuffered DIMMs is not allowed.
        4. The maximum supported speed is defined by the BIOS and not the DIMMs
        5. Mixing different timing DIMMs will force the operation at the slowest DIMM for both processors.
        RDIMM Ranks population in a three slots per channel configuration

        Configuration Number
        Max Speed
        DIMM2
        DIMM1
        DIMM0
        1
        DDR3-1333
        -
        -
        Single-rank
        2
        DDR3-1333
        -
        -
        Dual-rank
        3
        DDR3-1066
        -
        -
        Quad-rank
        4
        DDR3-1066
        -
        Single-rank
        Single-rank
        5
        DDR3-1066
        -
        Single-rank
        Dual-rank
        6
        DDR3-1066
        -
        Dual-rank
        Single-rank
        7
        DDR3-1066
        -
        Dual-rank
        Dual-rank
        8
        DDR3-800
        -
        Single-rank
        Quad-rank
        9
        DDR3-800
        -
        Dual-rank
        Quad-rank
        10
        DDR3-800
        -
        Quad-rank
        Quad-rank
        11
        DDR3-800
        Single-rank
        Single-rank
        Single-rank
        12
        DDR3-800
        Single-rank
        Single-rank
        Dual-rank
        13
        DDR3-800
        Single-rank
        Dual-rank
        Single-rank
        14
        DDR3-800
        Dual-rank
        Single-rank
        Single-rank
        15
        DDR3-800
        Single-rank
        Dual-rank
        Dual-rank
        16
        DDR3-800
        Dual-rank
        Single-rank
        Dual-rank
        17
        DDR3-800
        Dual-rank
        Dual-rank
        Single-rank
        18
        DDR3-800
        Dual-rank
        Dual-rank
        Dual-rank
         Table 3:  DIMM RANKS Population in 3 slots per channel
        This concludes all the basic elements we need for configuring the perfect server.

        -Nir