Ibm websphere application server 7 tuning guide

The core file ulimit and maximum file size ulimit are independent mechanisms to limit the maximum size of a core dump, since a core dump is just a regular file. These default values are restricted for many reasons, including:. If you increase core and maximum file size ulimits and you have insufficient disk space, then your application may be affected and the core dump itself may be truncated. For example, the default working directory of a WAS process is the profile directory and a core file will be written to the default working directory. This directory has many artifacts such as configuration, transaction information, logs, and more, so if core files fill up this disk, then application updates and other functionality may fail.

Therefore, we recommend that a dedicated filesystem is created for system dumps. This filesystem should be on a fast, local disk or even a RAMdisk. If you decide to change the working directory instead, be aware that this will also change where javacores and other artifacts go, so you should document this change for your administrators' awareness. We recommend unlimited core file and maximum file size ulimits because we cannot know ahead of time how large the virtual address space will become. For example, if you have a native memory leak, then the virtual address space can become very large before a problem occurs, and determining the problem with a truncated core dump will be very difficult or impossible.

You shouldn't wait until you have a problem. We recommend that proper ulimits are set ahead of time for any WAS installation.

  • IBM WebSphere Application Server Performance Cookbook - Home.
  • WebSphere Application Server V Performance Tuning;
  • download facebook for samsung galaxy s5570.
  • IBM WebSphere Application Server Performance Cookbook.
  • download bass guitar for android.

In general, the easiest way is to update the relevant global ulimit configuration file: Then you will need to restart all WAS processes and if you were logged in to a shell from which you restart such processes, you will need to log out and log back in. Alternatively, you can set these ulimits in the shell that spawns the process before the Java command:.

If you are starting servers from the Administrative Console or using wsadmin, then make sure that you set these ulimits in the node agent startNode. If you are manually starting servers using startServer. While writing a core dump, the operating system will completely pause the process. After the core dump is finished, the process will resume where it left off.

The time it takes to write the core dump is mostly proportional to the virtual address space size, available file cache in physical memory, and disk speed. The file cache is an area of physical memory RAM that is used as a write-behind or write-through cache for some virtual file system operations. If a file is created, written to, or read from, the operating system may try to perform some or all of these operations through physical memory and then flush any changes to disk.

The best way to try to improve performance of writing system dumps is to ensure that physical memory and file cache have at least the amount of the virtual size of the process available. This means that the operating system will write the core to RAM, continue the process, and then asynchronously write it to disk. Finally, disk speed is a major factor in core dump writing performance for obvious reasons and you can increase performance by dedicating a faster disk for core dump processing.

The following IBM Java versions introduced a ulimit section in a javacore: You can take a javacore using "kill -3 PID" or other ways and search for this section:. On Solaris, you can use the "plimit" command to print the current ulimits for a running process. There is a valid philosophical question of whether to set every ulimit to unlimited for particular users such as those running WAS note that some values do not accept "unlimited" such as maximum open files on Linux but instead take a specific maximum value.

Changing ulimit values is not really a "tuning" exercise since these values are simply restrictions on what a process can do. It is true that some of these resources directly or indirectly use kernel memory which is a shared, limited resource; however, a well-behaved kernel should kill any process that exhausts its resources and the resulting core dump should have obvious symptoms of the offending resource usage or leak.

Until that point, it's not clear why potentially legitimate resource usage is constricted by these arbitrary default ulimits or equally arbitrary ulimits you may find in other tuning documents. Also review the general topics in the Operating Systems chapter. Check the system log for any warnings, errors, or repeated informational messages.

The location or mechanism depends on the distribution. For example:. Each of these pseudo files is documented in "man 5 proc": Notice that the user must have sufficient permissions, and simply prepending sudo is also not enough. The reason a simple "sudo echo" doesn't work is that this runs the echo command as root, but the output redirection occurs under the user's context.

  • samsung galaxy mini remove gmail account?
  • opera browser old version for mobile;
  • ?
  • mobile friendly email templates free;
  • .

Therefore, you must use something like the tee command:. This works but the change will be reverted on reboot. This lists key value pairs to be set on boot, separated by an equal sign. Or to temporarily update variables similar to echo above and similar to the sysctl. This will not only set the currently running settings, but it will also ensure that the new settings are picked up on reboot. You can control which columns are printed and in which order using -o.

Without arguments, it will periodically update the screen with updated information:. In the case above, the Java process is using all the available CPU but is not contending with any other process. Therefore, the limiting performance factor is the CPU available to the machine. Use the -b flag to run top in a batch mode instead of redrawing the screen every iteration.

Use -d to control the delay between iterations and -n to control the number of iterations. The output of top -H on Linux shows the breakdown of the CPU usage on the machine by individual threads.

WebSphere Application Server V8.5.5 Performance Tuning

The top output has the following sections of interest:. The thread ID. This can be converted into hexadecimal and used to correlate to the "native ID" in a javacore. The state of the thread. This can be one of the following:. The percentage of a single CPU usage by the thread The amount of CPU time used by the thread. The Java process is not limited by other processes.

You can also see that the CPU usage of the Java process is spread reasonably evenly over all of the threads in the Java process. This spread implies that no one thread has a particular problem. A report indicating that active processes are using a small percentage of CPU, even though the machine appears idle, means that the performance of the application is probably limited by points of contention or process delay, preventing the application from scaling to use all of the available CPU.

If a deadlock is present, the reported CPU usage for the Java process is low or zero. Where you have threads of interest, note the PID values because you can convert them to a hexadecimal value and look up the threads in the javacore. In this way you gain an understanding of the kind of work that the thread does from the thread stack trace in the javacore. For example, the PID becomes 7A15 in hexadecimal. This value maps to the "native ID" value in the javacore.

You can convert the thread ID into hexadecimal and search for it in a matching javacore. Search in javacore for native ID:. The following command may be used to periodically gather the top 50 threads' CPU usage for the entire machine:. Note that this example of top -H may consume a significant amount of CPU because it must iterate over all threads in the system. If these numbers are greater than the number of CPU cores, then there may be cause for concern.

  1. download skype in cell phone!
  2. IBM WebSphere Application Server Performance Cookbook - Single Page.
  3. new iphone 6s plus price in pakistan.
  4. If CPU utilization does not correlate with load averages, review the number of threads in the "D" uninterruptible state. See also: You can report sar data textually on the system using the "sar" command:.

    WebSphere Application Server Top 10 Tuning Recommendations

    You can also visualize sar log files using ksar which is BSD license https: One reason to use nmon on Linux is that the Java GUI nmon analyzer is a very powerful and flexible graphing application that accepts nmon data. Note that any errors starting nmon such as file pemissions writing to the specified directory will go to nohup. You can also run 'ps -elfx grep nmon' to make sure it started. Collectl is a comprehensive performance data collection utility similar to sar.

    It is fine grained with low overhead and holistically collects all of the important kernel statistics as well as process data. Additionally, it is a very simple tool to collect very useful performance data. While collectl is neither shipped nor supported by Red Hat at this time, it is a useful and popular utility frequently used by users and third party vendors. Starting with IBM Java 7. Additional limitations may require the use of -Xlp: Flame graphs are a great way to visualize CPU activity: The following example captures all On-CPU stacks every 50ms for 10 seconds and writes the data to a file called perf.

    It's generally a good idea to substract 1 from F e. There is no way to change the output file name to something other than perf. If the file perf. This means that the only way to get millisecond precision wallclock time of a perf stack is to create a separate file that notes the wallclock time with millisecond accuracy right before starting perf. Before recording, ensure that you have installed at least the kernel and glibc symbols these are only used by the diagnostic tools to map symbols, so they do not change the function of the OS but they do use about 1GB of disk space. The perf script command might give various errors and warnings and they're usually about missing symbols and mapping files, which is generally expected since it's sampling all processes on the box.

    Therefore, one can approximate the wallclock time of each stack by taking the difference between the first stack's time field and the target stack's time field and adding that number of seconds to the captured time minus the sleep time. Unfortunately, this only gives second level resolution because the captured time only provides second level resolution.

    System Tap simplifies creating and running kernel modules based on kprobes: Installing stap: For most interesting SystemTap scripts, the kernel development package and kernel symbols must be installed. Example scripts: There are two scripts that may be used and by default, they run for a few minutes thus they should be run during the issue:. Both scripts collect similar data and they are run with the set of process IDs for the JVMs as parameters. Both scripts also request thread dumps through kill In general, we recommend you use perfMustGather.

    This must be manually done for linperf. Query memory information https: On newer versions of Linux, use the "Available" statistics to determine the approximate amount of RAM that's available for use for programs:. They generally do this by adding up "free" and "cached", which was fine ten years ago, but is pretty much guaranteed to be wrong today. It is wrong because Cached includes memory that is not freeable as page cache, for example shared memory segments, tmpfs, and ramfs, and it does not include reclaimable slab memory, which can take up a large fraction of system memory on mostly idle systems with lots of files.

    However, this may change in the future, and user space really should not be expected to know kernel internals to come up with an estimate for the amount of free memory. If things change in the future, we only have to change it in one place. The amount of physical memory available for additional application use is the free Mem value, added to the values for buffers and cached.

    In the case above, this value is: Resident memory pages may be shared across processes. By default, Linux aggressively caches content such as parts of files in memory. Most or all of this physical memory usage will be pushed out of memory if program demands require it; therefore, in general, to understand physical memory usage, subtract "cached" and "buffers" from used memory:.

    There is a way to flush the file cache from physical memory. Although this is generally not required, it may be useful before running an iteration of a stress test to ensure maximum comparability with previous runs:. This is a non-destructive operation and will not free any dirty objects. This will minimize the number of dirty objects on the system and create more candidates to be dropped. This file is not a means to control the growth of the various kernel caches inodes, dentries, pagecache, etc These objects are automatically reclaimed by the kernel when memory is needed elsewhere on the system.

    Use of this file can cause performance problems. Because of this, use outside of a testing or debugging environment is not recommended. Provides information about distribution and utilization of memory. When the physical memory is full, paging also known as swapping occurs to provide additional memory. Paging consists of writing the contents of physical memory to disk, making the physical memory available for use by applications.

    The least recently used information is moved first. Paging is expensive in terms of performance because, when required information is stored on disk it must be loaded back into physical memory, which is a slow process. Where paging occurs, Java applications are impacted because of garbage collection. Garbage collection requires every part of the Java heap to be read.

    If any of the Java heap has been paged out, it must be paged back when garbage collection runs, slowing down the garbage collection process. The vmstat output shows whether paging was taking place when the problem occurred. The columns of interest are Nonzero values indicate that paging is taking place. It may be necessary to tune the kernel's shared memory configuration for products such as databases https: For example, set kernel. By default, the malloc implementation in glibc which was based on ptmalloc, which in turn was based on dlmalloc will allocate into either the native heap sbrk or mmap space, based on various heuristics and thresholds: If there's enough free space in the native heap, allocate there.

    In the raw call of sbrk versus mmap, mmap is slower because it must zero the range of bytes http: Starting with glibc 2. This is achieved by assigning threads their own memory pools and by avoiding locking in some situations. The default maximum arena size is 1MB on bit and 64MB on bit. The default maximum number of arenas is the number of cores multiplied by 2 for bit and 8 for bit. In principle, the net performance impact should be positive of per thread arenas, but testing different arena numbers and sizes may result in performance improvements depending on your workload.

    By default, Linux follows an optimistic memory allocation strategy. This means that when malloc returns non-NULL there is no guarantee that the memory really is available. In case it turns out that the system is out of memory, one or more processes will be killed by the OOM killer https: When the OOM killer is invoked, a message is written to the system log.

    Recent versions include a list of all tasks and their memory usage. Creating a dump on a panic requires configuring kdump: The kernel decides which process to kill based on various heuristics and per-process configuration see chapter 3 section 1 in https: The OOM killer may be disabled. For example, set vm. In this case, malloc will return NULL when there is no memory and available. Many workloads can't support such configurations because of high virtual memory allocations.

    While there is considerable philosophical debate about swap, consider disabling swap, setting vm. For more information on how to produce and analyze kernel vmcores, see https: Higher values will increase aggressiveness, lower values decrease the amount of swap. The default value is The reclaim code works in a very simplified way by calculating a few numbers:. With those numbers in hand, the kernel calculates its "swap tendency": Once it goes above that value, however, pages which are part of some process's address space will also be considered for reclaim.

    Users who would like to never see application memory swapped out can set swappiness to zero; that setting will cause the kernel to ignore process memory until the distress value gets quite high. A value of 0 tells the kernel to avoid paging program pages to disk as much as possible. A value of encourages the kernel to page program pages to disk even if filecache pages could be removed to make space.

    Note that this value is not a percentage of physical memory, but as the above example notes, it is a variable in a function. This may be adversely affecting you if you see page outs but filecache is non-zero. For example, in vmstat, if the "so" column is non-zero you are paging out and the "cache" column is a large proportion of physical memory, then the kernel is avoiding pushing those filecache pages out as much as it can and instead paging program pages.

    In this case, either reduce the swappiness or increase the physical memory. This assumes the physical memory demands are expected and there is no leak. Some prefer to use vm. Unless tracking file and directory access times is required, use the noatime and nodiratime flags or consider relatime when mounting filesystems to remove unnecessary disk activity http: The -o parameter adds the Timer column which will show various timers.

    For example, the first number before the slash for timewait indicates how many seconds until the socket will be cleared. Ping a remote host. In general, and particularly for LANs, ping times should be less than a few hundred milliseconds with little standard deviation. Since kernel 2. Recv-Q Established: The count of bytes not copied by the user program connected to this socket.

    Since Kernel 2. Send-Q Established: The count of bytes not acknowledged by the remote host. Running lsof if only interested in network some of the flags imply not showing regular files:. NFS may be monitored with tools such as nfsiostat https: For example, to query the ring buffers:.

    The default receive buffer size for all network protocols is net. The default or requested receive buffer size is limited by net. Starting with Linux 2. If auto-tuning is enabled, the kernel will start the buffer at the default and modulate the size between the first min and third max values of net. In general, the min should be set quite low to handle the case of physical memory pressure and a large number of sockets. The default send buffer size for all network protocols is net.

    The default or requested send buffer size is limited by net. By default, Linux sets these to proportions of RAM on boot. Query the value with sysctl and multiply the middle number by the page size often and this is the number of bytes at which point the OS may start to trim TCP buffers. Tuning done for SPECj: For example, to emulate a ms delay on all packets:. The TCP specifications require that a TCP sender implements a congestion window cwnd to regulate how many packets are outstanding at any point in time. The congestion window is in addition to any constraints advertised by the receiver window size.

    The default congestion algorithm is cubic. A space-delimited list of available congestion algorithms may be printed with:. Additional congestion control algorithms, often shipped but not enabled, may be enabled with modprobe. An example symptom of a congestion control algorithm limiting throughput is when a sender has queued X bytes to the network, the current receive window is greater than X, but less than X bytes are sent before waiting for ACKs from the receiver.

    In one case, changing to hybla, which is designed for high latency connections, improved performance. In another case, on a low latency network, changing to hybla decreased performance. Another commonly used algorithm is htcp. The congestion window is not advertised on the network but instead lives within memory on the sender. The default congestion window size initcwnd may be changed by querying the default route and using the change command with initcwnd added. Starting with kernel version 2. In some benchmarks, changing the values of net. To update the listen backlog, set net.

    To increase the maximum incoming packet backlog, set net. Each network adapter has an outbound transmission queue which limits the outbound TCP sending rate. Update the TCP Keepalive interval by setting net. Update the TCP Keepalive probe count by setting net. Capture network packets using tcpdump http: Normally, tcpdump is run as root. For example, capture all traffic in files of size MB and up to 10 historical files -C usually requires -Z:. If -W is not specified, the behavior is unclear with some testing showing strange behavior, so it's best to specify -W. Use Wireshark to analyze covered in the Major Tools chapter.

    Depending on the version of tcpdump, it may have a default snarflen -s which is very small or In the case of , remember that most packets are much smaller than this usually limited by the MTU , so is effectively unlimited. Filter what traffic is captured using an expression see the man page. For example, to only capture traffic coming into or going out of port In addition to using Wireshark, you may also dump the tcpdump on any Linux machine using the same tcpdump command.

    If you would like to only capture the TCP headers, then the best way to do this is to do a capture of representative traffic, then load in Wireshark, filter to tcp packets, sort by frame length and then take the smallest value and use this value N for -s. Packets that arrive for a capture are stored in a buffer, so that they do not have to be read by the application as soon as they arrive. On some platforms, the buffer's size can be set; a size that's too small could mean that, if too many packets are being captured and the snapshot length doesn't limit the amount of data that's buffered, packets could be dropped if the buffer fills up before the application can read packets from it, while a size that's too large could use more non-pageable operating system memory than is necessary to prevent packets from being dropped.

    This can be helpful in certain situations when there are low level delays such as writing to disk strace , or investigating library calls such as libc malloc calls ltrace. The taskset command may be used to assign the CPUs for a program when the program is started. For example: Usually, the Linux kernel handles network devices by using the so called New API NAPI , which uses interrupt mitigation techniques, in order to reduce the overhead of context switches: On low traffic network devices everything works as expected, the CPU is interrupted whenever a new packet arrives at the network interface.

    This gives a low latency in the processing of arriving packets, but also introduces some overhead, because the CPU has to switch its context to process the interrupt handler. Therefore, if a certain amount of packets per second arrives at a specific network device, the NAPI switches to polling mode for that high traffic device.

    In polling mode the interrupts are disabled and the network stack polls the device in regular intervals. It can be expected that new packets arrive between two polls on a high traffic network interface. Thus, polling for new data is more efficient than having the CPU interrupted and switching its context on every arriving packet. Polling a network device does not provide the lowest packet processing latency, though, but is throughput optimized and runs with a foreseeable and uniform work load. When processes are pinned to specific sets of CPUs, it can help to pin any interrupts that are used exclusively or mostly by those processes to the same set of CPUs.

    The IP address was configured on a specific Ethernet device. The Ethernet device was handled by one or more interrupts or IRQs. The heavy-handed approach is to simply turn off the irqbalance service and keep it from starting on boot up. If you need the irqbalance service to continue to balance the IRQs that you don't pin, then you can configure irqbalance not to change the CPU pinnings for IRQs you pinned.

    It is best to read the rows from right to left. Find the device name in the last column, then look at the beginning of the row to determine the assigned IRQ. The low order bit is CPU 0. Most modern network adapters have settings for coalescing interrupts. In interrupt coalescing, the adapter collects multiple network packets and then delivers the packets to the operating system on a single interrupt. The advantage of interrupt coalescing is that it decreases CPU utilization since the CPU does not have to run the entire interrupt code path for every network packet.

    The disadvantage of interrupt coalescing is that it can delay the delivery of network packets, which can hurt workloads that depend on low network latency. On some network adapters the coalescing settings are command line parameters specified when the kernel module for the network adapter is loaded. On the Chelseo and Intel adapters used in this setup, the coalescing settings are changed with the ethtool utility.

    To see the coalescing settings for an Ethernet device run ethtool with the -c option. Many modern network adapters have adaptive coalescing that analyzes the network frame rate and frame sizes and dynamically sets the coalescing parameters based on the current load. Sometimes the adaptive coalescing doesn't do what is optimal for the current workload and it becomes necessary to manually set the coalescing parameters. Coalescing parameters are set in one of two basic ways.

    One way is to specify a timeout. The adapter holds network frames until a specified timeout and then delivers all the frames it collected. The second way is to specify a number of frames. The adapter holds network frames until it collects the specified number of frames and then delivers all the frames it collected. A combination of the two is usually used. To set the coalescing settings for an Ethernet device, use the -C option for ethtool and specify the settings you want to change and their new values.

    This workload benefited from setting the receive timeout on the WebSphere server to microseconds, the maximum allowed by the Chelseo driver, and disabling the frame count threshold. On the database server, increasing the receive timeout to microseconds was sufficient to gain some efficiency. The setup used only IPv4 connections; it did not need IPv6 support. It got a small boost to performance by disabling IPv6 support in Linux. IPv6 support can be disabled in the Linux kernel by adding the following options to the kernel command line in the boot loader configuration. Disabling IPv6 support in the Linux kernel guarantees that no IPv6 code will ever be run as long as the system is booted.

    That may be too heavy-handed. A lighter touch is to let the kernel boot with IPv6 support and then use the sysctl facility to dynamically set a kernel variable to disable IPv6. The example above disables IPv6 on all interfaces. You can optionally disable IPv6 support on specific interfaces. The default page size is 4KB. Large pages on Linux are called huge pages, and they are commonly 2MB or 1GB depending on the processor. In general, large pages perform better for most non-memory constrained workloads because of fewer and faster CPU translation lookaside buffer TLB misses. There are two types of huge pages: In general, transparent huge pages are preferred.

    Note that there are some potential negatives to huge pages: In recent kernel versions, transparent huge page THP support is enabled by default and automatically tries to use huge pages: The status of THP can be checked with:. Transparent huge pages use the khugepaged daemon to periodically defragment memory to make it available for future THP allocations.

    If this causes problems with high CPU usage, defrag may be disabled, at the cost of potentially lower usage of huge pages:. It's also possible to limit defrag efforts in the VM to generate hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular pages unless hugepages are immediately available. Clearly if we spend CPU time to defrag memory, we would expect to gain even more by the fact we use hugepages later instead of regular pages.

    An application may mmap a large region but only touch 1 byte of it, in that case a 2M page might be allocated instead of a 4k page for no good. The older method to use huge pages involves libhugetlbfs and complex administration: Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure. In this example, there are no hugetlb pages in use, although 1GB is reserved by some processes. More information: The default value for this file, , results in the same range of PIDs as on earlier kernels.

    Modify ulimit -u nproc to Review all users' crontabs and the processing that they do. Some built-in crontab processing such as monitoring and file search may have significant performance impacts. The algorithms used in the CFS provide efficient scheduling for a wide variety of system and workloads.

    And when it preempts the current thread it can ruin some of the cache warmth that the thread has created. The CFS has a list of scheduling features that can be enabled or disabled. The setting of these features is available through the debugfs file system. It is and will be available in the RedHat Enterprise Linux 6 releases. All of the tasks on the runqueue are guaranteed to be scheduled once within this period. So, the greatest amount of time a task can be given to run is inversely correlated with the number of tasks; fewer tasks means they each get to run longer.

    Here is some sample bash code for disabling preemption. The documentation is a little fuzzy on how this parameter actually works. It controls the ability of tasks being woken to preempt the current task. The smaller the value, the easier it is for the task to force the preemption. The CFS algorithm is different from the scheduling algorithms for previous Linux releases. It might change the performance properties of some applications. If you encounter this problem, you might observe high CPU usage by your Java application, and slow progress through synchronized blocks. The application might appear to stop because of the slow progress.

    There are two possible workarounds: Invoke the JVM with the additional argument -Xthr: Use the following command-line option to revert to behavior that is closer to earlier versions and monitor application performance: It is the primary vehicle in which research conducted by Red Hat's Performance Engineering Group is provided to customers. However, depending on the BIOS configuration, settings applied by tuned may be overridden or not applied. Consider testing with RedHat's "performance" or "network-latency" system profiles:.

    The ABRT uses a kernel core pattern pipe hook to capture crash core dumps https: Ensure there is sufficient space. The default page size on Linux on Power is 64KB https: Some workloads benefit from lower SMT hardware thread values https: Running profile on Linux on Power: Consider testing with -Xnodfpbd because "The hardware instructions can be slow. Consider disabling hardware prefetching because Java does it in software.

    Idle Power Saver, [which is enabled by default], will put the processor into a power saving mode when it detects that utilization has gone below a certain threshold for a specified amount of time. Switching the processor into and out of power saving mode takes time. For sustained peak performance it is best not to let the system drop into power saving mode.

    Set the Idle Power Saver value to Disabled, then click on the "Save settings" button on the bottom of the page. The Adaptive Frequency Boost feature allows the system to increase the clock speed for the processors beyond their nominal speed as long as environmental conditions allow it, for example, the processor temperature is not too high. Adaptive Frequency Boost is enabled by default.

    Change the setting to Enabled, then click on the "Save settings" button. The PowerLinux systems have a feature called Dynamic Power Saver that will dynamically adjust the processor frequencies to save energy based on the current processor utilization. On each system the two network cards were installed in the two bit DMA slots. One question for tuning a multi-threaded workload for increased capacity is whether to scale up by adding more processor cores to an instance of an application or to scale out by increasing the number of application instances, keeping the number of processor cores per application instance the same.

    By default, a WAS instance is configured with multi-home enabled, which means it listens for requests on its port on all of the IP addresses on the system. If multiple WAS instances are running, they cannot all be allowed to listen for requests on all the IP addresses. They would end up stepping on each other and would not function correctly. For instructions on how to configure an application server to use a single network interface, see Configuring an application server to use a single network interface [4] in the WebSphere Application Server Version 8.

    This can easily be done if the number of Ethernet devices on the system is greater than or equal to the number of WAS instances, the IP addresses for the WAS instances can each be put on their own Ethernet device. If the system has fewer Ethernet devices than the number of WAS instances, then aliases can be used to create multiple virtual devices on a single physical Ethernet device. See section 9. See value of WAS on zLinux here: The first two are RAM and the last is disk. Discontiguous Saved Segments DCSS may be mounted in zLinux to share data across guests, thus potentially reducing physical memory usage: DCSS can also be used as an in-memory filesystem.

    AIX product documentation: Apply recommended AIX level tuning for Java applications http: Note, if IBM Java uses shmget to allocate the Java heap, then it will immediately mark the shared memory region for deletion, so that if the JVM crashes, the memory will be released. It is important to use the most optimal SMT setting for the machine, based on the number of CPU-intensive processes running on the machine and their threading characteristics. If the machine is running one or a few single threaded applications then disabling SMT may be the most optimal setting.

    On the other hand, if the machine is running a large, multi-threaded application or several CPU-intensive processes, running in SMT4 mode may be the most optimal setting. The benefit of Micro-Partitioning is that it allows for increased overall utilization of system resources by applying only the required amount of processor resource needed by each partition. But due to the overhead associated with maintaining online virtual processors, consider the capacity requirements when choosing values for the attributes.

    For optimal performance, ensure that you create the minimal amount of partitions, which decreases the overhead of scheduling virtual processors. CPU-intensive applications, like high performance computing applications, might not be suitable for a Micro-Partitioning environment. If an application uses most of its entitled processing capacity during execution, you should use a dedicated processor partition to handle the demands of the application.

    For PowerVM, a dedicated partition is preferred over a shared partition or a workload partition for the system under test. By default, CPU folding occurs in both capped and uncapped modes, with the purpose being to increase CPU cache hits http: In general, CPU folding should not be disabled, but low values of CPU folding as seen in nmon, for example may indicate low entitlement. Processor folding may be disabled with:. Use mpstat to review processor affinity: Query processor usage http: The curt tool uses kernel trace data into exact CPU utilization for a period of time: Generate curt input data: Generate curt output: It used to be a standalone utility but it has recently been integrated into the operating system: It is generally recommended to always run nmon or something similar in the background.

    Note that any errors starting nmon such as inadequate file permissions when trying to write to the specified directory will go to nohup. We recommend using the Java based nmon visualizer which can be found here: The older spreadsheet based visualizer can be found here: The tprof command, an AIX operating system provided utility, uses sampling at the processor level to determine which process, thread, and routine are using CPU time. The output from the tprof -skex sleep 60 command produces a file called sleep. The subset of the total samples that were in Kernel routines. The subset of the total samples that were in User routines.

    The subset of the total samples that were in Shared library routines. For a Java application, this value represents time spent inside the Java runtime itself or time spent running JNI code. For a Java application, this value represents time spent running the Java methods. The value in the Total column for the Java executable compared to the overall Total shows the percentage of overall CPU being used by the Java processes. The values of Kernel, Shared, and Other for the Java executable shows how much time was spent in the Kernel, running Java runtime support routines, and how much time was spent running the Java methods themselves.

    AIX ships with a JVMTI agent libjpa that allows tprof to see method names; however, if you've isolated the processor usage in tprof to user Java code, then it is generally better to use a Java profiler such as Health Center. To use the agent pass -agentlib: This information implies that the performance of the application is being limited by points of contention or delay in the Java process, which is preventing it from scaling to use all of the available CPU.

    If a deadlock was present, the CPU usage by the Java process would be low or zero. For threads of interest, note the TID values. You can convert these values to a hexadecimal value and look up the threads in [a] javacore. The number of seconds passed in the above example, is not the duration for the entire script, but the maximum for parts of it e.

    For the minimum duration of 60 seconds, the total duration will be about 10 minutes. To make this more efficient, run the following dynamic command:. Paging is expensive in terms of performance because, when the required information is stored on disk it must be loaded back into physical memory, which is a slow process. The columns of interest are pi and po page in and page out for AIX Unless otherwise noted, svmon numbers, such as inuse and virtual, are in numbers of frames, which are always 4KB each, even if there are differently sized pages involved.

    Memory inuse on the first row is the physical memory being used. This is split on the second row between work for processes, pers for file cache and clnt for NFS file cache. If the memory inuse value is equal to the memory size value, then all the physical memory is being used. Whilst file caching should be released before paging out application data, depending on system demand the application memory pages may be swapped out. This maximum usage of the physical memory by file caching can be configured using the AIX vmtune command along with the the minperm and maxperm values.

    If all the physical memory is being used, and all or the majority of the in use memory shown on the fourth row is for work pages, then the amount of physical memory should be increased. A segment is always MB. Dynamic page promotion occurs when a set of contiguous pages e. This is done by psmd Page Size Management Daemon. Larger page sizes may reduce page faults and are more efficient for addressing, but may increase overall process size due to memory holes.

    If two processes are referencing the same VSID, then they are sharing the same memory. A typical virtual address, e. Segment 0x0 is always reserved for the kernel. Segment 0x1 is always reserved for the executable code java. If you need more native memory i. The cost is that shared libraries are loaded privately which increases system-wide virtual memory load and thus potentially physical memory requirements! The change should not significantly affect performance, assuming you have enough additional physical memory.

    Consider reducing the inode cache if you observe memory pressure: Determine if the values are too high by comparing the number of client segments in the 'svmon -S' output with the number of unused segments. Also consider the absolute number of client segments. As files are opened, we expect these numbers to go up. In most cases, reduce them to each. Tips on network monitoring: Tune various kernel parameters based on the type and MTU size of the adapter: If dedicated network adapters are set up for inter-LPAR network traffic, recent versions of AIX support super jumbo frames up to bytes:.

    The netstat command can be used to query kernel network buffers http: If using ethernet interfaces, check Packets Dropped: If "Max Allocated" for a column is greater than "Min Buffers" for that column, this may have caused reduced performance. Increase the buffer minimum using, for example:. If "Max Allocated" for a column is equal to "Max Buffers" for that column, this may have caused dropped packets. Increase the buffer maximum using, for example:. It is necessary to bring down the network interface s and network device s changed by the above commands and then restart those devices and interfaces.

    Some customers prefer to simply reboot the LPAR after running the command s. Capture network packets using iptrace http: For example, the following limits to MB:. For example, the following limits each packet to 96 bytes:. The "no" command is used to query or set network related kernel parameters. To display current values:. However, this only lasts until reboot. By default it is off, but security hardening commands such as aixpert may enable it indirectly. If you are experiencing mysterious connection resets at high load a message is not logged when this function is exercised , this may be working as designed and you can tune or disable this function using the tcptr and no commands: Ensure that all network interfaces are either explicitly set to maximum speed or they are auto-negotiating maximum speed.

    Also ensure that the speed is full duplex: Next, for each interface that will be used, query the running speed: Consider using 16MB large pages https: As root user, run the following commands to reserve 4 GB of large page: The AIX scheduler generally does a good job coordinating CPU usage amongst threads and processes, however manually assigning processes to CPUs can provide more stable, predictable behavior.

    Binding processes to particular CPUs is especially important on systems with multiple processing modules and non-uniform memory access see the next section on memory affinity , and also depending on how various levels of cache are shared between processors. It is best to understand the system topology and partition resources accordingly, especially when multiple CPU intensive processes must run on the machine. The easiest way to do this is using the execrset command to specify a list of CPUs to bind a command and its children to:.

    The current SMT mode and logical to physical mappings can be queried using the smtctl command. It is important to note that currently the J9 JVM configures itself based on the number of online processors in the system, not the number of processors it is bound to which can technically change on the fly. Therefore, if you bind the JVM to a subset of CPUs you should adjust certain thread-related options, such as -Xgcthreads, which by default is set to the number of online processors.

    Memory affinity can be an important consideration when dealing with large systems composed of multiple processors and memory modules. Each processing module can have a system memory chip module MCM attached to it, and while any processors can access all memory modules on the system, each processor has faster access to its local memory module. If memory affinity is enabled the default memory allocation policy is a round-robin scheme that rotates allocation amongst MCMs. The dscrctl command sets the hardware prefetching policy for the system. Hardware prefetching is enabled by default and is most effective when memory access patterns are easily predictable.

    The hardware prefetcher can be configured with various schemes, however most transaction oriented Java workloads may not benefit from hardware prefetching so you may see improved performance by disabling it using dscrctl -n -s 1. Starting with Java 6. The multiheap option does have costs, particularly increased virtual and physical memory usage. The primary reason is that each heap's free tree is independent, so fragmentation is more likely. There is also some additional metadata overhead. Increasing the number of malloc heaps does not significantly increase the virtual memory usage directly there are some slight increases because each heap has some bookkeeping that it has to do.

    However, each heap's free tree is independent of others, but the heap areas all share the same data segment, so native memory fragmentation becomes more likely, and thus indirectly virtual and physical memory usage may increase. It is impossible to predict by how much because it depends on the rate of allocations and frees, sizes of allocations, number of threads, etc. It is best to take the known physical and virtual memory usage of a process before the change rss, vsz at peak workload, so let's call this X GB for example, 9 GB.

    Then apply the change and run the process to peak workload and monitor. As long as there is that much additional physical memory available, then things should be okay. It has been observed in some cases that this is related to the default, single threaded malloc heap. Using the pool front end and multiheap malloc in combination is a good alternative for multi-threaded applications. Small memory block allocations, typically the most common, are handled with high efficiency by the pool front end.

    Larger allocations are handled with good scalability by the multiheap malloc. A simple example of specifying the pool and multiheap combination is by using the environment variable setting:. This suboption is similar to the built-in bucket allocator of the Watson allocator. However, with this option, you can have fine-grained control over the number of buckets, number of blocks per bucket, and the size of each bucket. This option also provides a way to view the usage statistics of each bucket, which be used to refine the bucket settings.

    All Things WebSphere: WebSphere Application Server Top 10 Tuning Recommendations

    In case the application has many requests of the same size, then the bucket allocator can be configured to preallocate the required size by correctly specifying the bucket options. The block size can go beyond bytes, compared to the Watson allocator or malloc pool options. For a bit single-threaded application, use the default allocator. For a bit application, use the Watson allocator.

    All Things WebSphere

    Multi-threaded applications use the multiheap option. Set the number of heaps proportional to the number of threads in the application. For single-threaded or multi-threaded applications that make frequent allocation and deallocation of memory blocks smaller than , use the malloc pool option.

    For a memory usage pattern of the application that shows high usage of memory blocks of the same size or sizes that can fall to common block size in bucket option and sizes greater than bytes, use the configure malloc bucket option.

    IBM Websphere Application Server Peformance Tuning Toolkit Setup

    For older applications that require high performance and do not have memory fragmentation issues, use malloc 3. Ideally, the Watson allocator, along with the multiheap and malloc pool options, is good for most multi-threaded applications; the pool front end is fast and is scalable for small allocations, while multiheap ensures scalability for larger and less frequent allocations. If you notice high memory usage in the application process even after you run free , the disclaim option can help. Network jumbo frame support can increase throughput between the application server and database.

    A commonly used Linux client program is x On some client keyboard layouts, the right "Ctrl" key is used as the "Enter" key. After logging in through a session, it is common to access most programs by typing ISPF. Typically, if available, F7 is page up, F8 is page down, F10 is page left, and F11 is page right. Typing "m" followed by F7 or F8 pages down to the top or bottom, respectively. LOG shows the system log and it is the most common place to execute system commands. Then use F8 or press enter to refresh the screen to see the command's output.

    DA shows active address spaces http: In the NP column, type "S" next to an address space to get all of its output, or type? When viewing joblog members of an address space? DA , type XDC next to a member to transfer it to a data set. Display threads in an address space and the accumulated CPU by thread: You can search for PID in the joblogs of the address space. This display shows all the threads in the address space, but remember that threads that are WLM managed e.

    WebSphere trace entries also contain the TCB address of the thread generating those entries. The SDSF. Think of your infrastructure as a plumbing system. Optimal drain performance only occurs when no pipes are clogged. Think of the thread pool as a queuing mechanism to throttle how many active requests you will have running at any one time in your application. You need to more than likely cut back on the number of threads active in the system to ensure good performance for all applications due to context switching at OS layer for each thread in the system Sizing or restricting the max number of threads a application can have can sometimes be used to prevent rouge applications for impacting others.

    Default sizes for WAS thread pools on v6. Monitor PMI metrics via TPV or others tools to watch for threads waiting on connections to the database as well as their wait time. Keeping cell size smaller leads to more efficient resource utilization due to less network traffic for configuration changes, DRS, HAManager, etc. For example, core groups should be limited to no more than 40 to 50 instances Smaller cells and logic grouping make migration forward to newer versions of products easier and more compartmentalized.

    Look at load on the database system, network, etc and extrapolate if it will support the full systems load and if not of if there are questions test Performance testing needs to be representative of patterns that your application will actually be executing Proper performance testing keeps track of and records key system level metrics as well as throughput metrics for reference later when changes to hardware or application are needed.