存档

‘排错’ 分类的存档

Java 类加载器的又一篇文章

2010年3月3日 hashei 2 条评论

之前写过两篇关于java类加载的文章,分别是:《WebSphere的类加载机制和故障排查》,《再谈WebSphere的类加载和故障排查》。今天在IBM网站上看到一篇《深入探讨 Java 类加载器》,分享出来炒炒冷饭。以后遇到问题的时候也能有点方向。

Java 虚拟机默认的行为就已经足够满足大多数情况的需求了。不过如果遇到了需要与类加载器进行交互的情况,而对类加载器的机制又不是很了解的话,就很容易花大量的时间去调试 ClassNotFoundExceptionNoClassDefFoundError 等异常。本文将详细介绍 Java 的类加载器,帮助读者深刻理解 Java 语言中的这个重要概念。

WebSphere简单故障排查

2009年9月25日 hashei 没有评论

工作中经常遇到这样那样的或有迹可循、或“灵异”的情况:WebSphere在某次停止后无法启动了,部署在集群上的应用无法通过IHS访问,应用更新后重启服务器发送回滚……出现问题当然都可以联系专门的中间件管理员来解决,但等管理员赶到现场,也许时间已过去半天,问题也许很简单,几分钟就能解决,所以如果你会一些基本的排查技巧和诊断方法,那么这些小问题就可以自己迎刃而解了。

下面我就介绍几种常见的简单错误,希望对于现场人员能有所帮助:

应用无法访问

下面是一张常见的由IBM HTTP SERVER(IHS)转发到后端AppCluster上的拓扑结构:

nd topo

应用无法访问,问题可以出现在HTTP Server上,或者App Server上,更可能发生在数据库上,所以第一步需要缩小范围,确定问题发生的点。

我在这里假设IHS的应用地址为http://192.168.1.51/yingyong

DMGR的访问地址是http://192.168.1.51:9060/admin

APP SERVER的应用地址为http://192.168.2.50:9080/yingyong和 http://192.168.2.51:9080/yingyong

 

1. 找不到服务器或404错误

访问http://192.168.1.51,确定IHS是否正常,如果页面无法显示,那么去“服务”中尝试重启“IBM HTTP SERVER V6.x”。服务启动失败的话,“服务”只会提示你一句服务无法启动或者启动后又因为致命错误停止。所以你要到IBM\HTTPServer\bin目录下运行apache –k start或者httpd –k start,失败的话会有详细信息供参考。一般是端口被占用或者config目录下的httpd.conf格式出错(它会提示你出错的行数)。

如果IHS访问完好,那么尝试分别访问http://192.168.2.50(51):9080/yingyong,如果访问正常,那么是IHS转发失败。

ihs转发

可以在管理控制台http://192.168.1.51:9060/admin中的“服务器”——“Web服务器”中勾选相应的webserver,“生成插件”并且“传播插件”。

 

 

很多IHS转发失败是因为应用发布过程中没有选则发布到webserver上,或在传播插件的过程中,由于目录访问控制等原因传播失败。你可以在“应用程序”中找到自己的应用,点击“管理模块”,确定是否正确的发布到app server上和webserver上了,注意首先在第一个框中选择要发布到集群和服务器,然后勾选模块前的勾,最后一定要点“应用”,而不是直接确定。

application deployment

转发失败的原因很多,不过最快的解决方法是手动复制文件。生成插件后控制台会提示文件生成的位置,直接拿到然后复制到传播插件失败的位置就可以了。

不过我也遇到过很蹊跷的情况,明明部署正确,传播正确,确依旧无法访问。这时候你要看一下生成的plugin-cfg.xml文件

<UriGroup Name="default_host_server1_xzh-hasheiNode01_Cluster_URIs">
      <Uri AffinityCookie="JSESSIONID" AffinityURLIdentifier="jsessionid" Name="/snoop/*"/>
      <Uri AffinityCookie="JSESSIONID" AffinityURLIdentifier="jsessionid" Name="/hello"/>
      <Uri AffinityCookie="JSESSIONID" AffinityURLIdentifier="jsessionid" Name="/hitcount"/>

       是否有你的应用url那行存在,不存在的话手动添加一下即可,不过记得下次生成插件后注意再修改。

       最后要确定app server是否已经启动,是否遇到错误退出了,这点在下面一部分细说。

2. 505 Internal Error

505内部错误有三种情况,一是程序出错,不是本文讨论的重点。二是AppServer或应用程序没有正常启动,三是数据库连接失败。

AppServer是否运行可以通过访问管理控制台,查看JAVA进程确定。在profiles\AppSrv01\logs\server1目录下会有一个pid文件,此文件记录的PID号即为进程号。Windows下在“任务管理器”点击“查看”—“选择列”,勾选PID-进程标识符即可显示。Unix/linux下运行ps –ef | grep PID或者ps –ef | grep java,查看该app的进程和所有的JAVA进程。注意:在安装DM profile的节点上,一般至少有DM、Node agent、app server三个java进程,注意区分。

确定服务器没有运行或者想重启时,在profiles\AppSrv01\bin下运行startServer.sh(bat)即可启动服务器,观察启动状况,直到出现“为电子商务开放服务器 server1”,即为启动成功。如果失败,那就要打开logs下的SystemOut.log,查看最新的日志,查找error信息。

一般启动失败无外乎端口冲突权限不够

端口冲突

端口出错在SystemOut.log中的信息如下:

TCPC0003E: TCP 通道 TCP_2 初始化失败。主机 * 和端口 9081 的套接字绑定失败。端口可能已在使用。

这时你可以用netstat –an 命令查看监听端口信息,然后用tcpview或者icesword等工具查看占用端口的进程,linux/unix下可以用netstat –an | grep LISTEN(或端口号)直接查看,然后使用lsof -i :端口号或者rmsock来查看占用端口的进程。

这时候你也许才恍然想起某个不经意的操作将websphere的端口占用了,怎么办?如果要WebSphere作出让步,那么可以修改profile_path\config\cells\cell_name\nodes\node_name目录中serverindex.xml文件:

specialEndpoints xmi:id="NamedEndPoint_1243228596786" endPointName="WC_adminhost">
<endPoint xmi:id="EndPoint_1243228596786" host="*" port="9060"/>
</specialEndpoints>
<specialEndpoints xmi:id="NamedEndPoint_1243228596787" endPointName="WC_defaulthost">
……

看到端口号了么?不过要注意WC_adminhost、WC_defaulthost、WC_adminhost_secure、WC_defaulthost_secure,也就是常用的管理端口、应用访问端口和它们各自的SSL端口,被修改后需要到profile_path\config\cells\cell_name再修改virtualhosts.xml文件中的相应端口(添加亦可),否则出现虚拟主机未定义的错误可别怪我没提醒。(我遇到过很多说用IHS可以访问,但是直接访问端口出错的情况,原因就是没有添加相应的虚拟主机,在管理控制台——虚拟主机——default host里添加改动后的端口就可以了)。

权限不足

权限不足一般发生在Unix/Linux下,比较常见的是安装websphere时新建了一个单独的用户和组,但是开发阶段权限管理不严导致开发人员也有root权限,启停没有su到was用户,等到权限回收之后发现无法启动服务了。这时候只要用root权限chown username/groupname 整个安装 目录即可。

还有一种情况是修改的端口<1024,在Unix/Linux下只能用root来起了。

其它情况

还要注意文件系统的情况,见过几次access.log和dump文件把文件系统撑满的。

应用更新失败

应用更新了,修改的文件直接上传到目录,重启应用程序,测试正常。等等!为何重启app server或者集群下重启dm后又变回修改前了呢?

这应该是dm的同步机制在捣鬼,你有没有注意到profiles\AppSrv01\config\cells\cell_name\applications目录下也有你的程序,打开可以看到并不是程序所有的内容都在此,而是web.xml和WEB-INF等重要内容。所以如果你更新的文件在config目录下也存在,那么你需要这里也更新一份。集群环境下还要注意profiles\Dmgr的config目录下还有一份等着你呢。

3. 确定数据库无故障

这个很简单,只要用sqlplus连接数据库正常且能查询即可。

4. 日志文件很重要

日志文件是排查的依赖。我见过不少项目,因为处于试运行修改阶段,log4j中输出日志信息极多,每条sql语句都丝毫不差的打出来,导致1m大小的SystemOut.log文件十几分钟就写满,10个SystemOut.log存档也顶不过几小时的日志量(单个文件1~2M,总共10~20个存档是一般设置),等我赶到时案发现场已经荡然无存。(这种情况一般是重启能暂时解决问题,但是故障原因没有找到)

所以即时保存当时日志是很重要的,logs\server1下的SystemOut.log、SystemErr.log一定要保存一份,并记下故障发生的时间。

WebSphere不像Weblogic,可以在console窗口后一直看到运行的日志,在unix/linux下,你可以用tail –f SystemOut.log来达到这个效果,windows下也有一个tail工具,后跟文件名运行就可以了。

tail tool tail tool

结束语

暂时能想到的简单排错就这些,这些都比较容易被开发人员遇到,所以还是很有必要了解一下的。

Weblogic10.3.0在AIX6.1、JDK1.6下挂起解决方法

2009年8月25日 hashei 25 条评论

上周在AIX6.1下安装weblogic10.3.0,并配置了hacmp集群环境,但是接下来的几天遇到了挂起问题,为此还加班了一天。

现象描述:

Weblogic启动后,10到30分钟就会hang住,应用和管理控制台都无法访问。强制kill -9 pid后端口无法释放,使用rmsock 命令查看端口显示Wait for exiting processes to be cleaned up before removing the socket。

分析及处理过程

1. 用ps –ef | grep java找到weblogic进程,每隔三分种执行kill -3 pid,在domain目录下生成javacore文件

2. 分析weblogic日志,发现如下内容

<Aug 21, 2009 4:33:37 AM CDT> <Error> <WebLogicServer> <BEA-000337> <[STUCK] ExecuteThread: ‘1′ for queue: ‘weblogic.kernel.Default (self-tuning)’ has been busy for “620″ seconds working on the request

“weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl@20de20de”, which is more than the configured time (StuckThreadMaxTime) of “600″ seconds. Stack trace:

java.net.SocketOutputStream.socketWrite0(Native Method)

java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:103)

……

<Aug 21, 2009 4:34:37 AM CDT> <Error> <WebLogicServer> <BEA-000337> <[STUCK] ExecuteThread: ‘1′ for queue: ‘weblogic.kernel.Default (self-tuning)’ has been busy for “680″ seconds working on the request

“weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl@20de20de”, which is more than the configured time (StuckThreadMaxTime) of “600″ seconds. Stack trace:

java.net.SocketOutputStream.socketWrite0(Native Method)

java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:103)

……

3. 用IBM Thread and Monitor Dump Analyzer for java分析刚才生成的thread dump,找到如下两个线程信息:

3XMTHREADINFO “[ACTIVE] ExecuteThread: ‘5′ for queue: ‘weblogic.kernel.Default (self-tuning)’” TID:0×39CBED00, j9thread_t:0×3751C83C, state:R, prio=5

3XMTHREADINFO1 (native thread ID:0xCE1DB, native priority:0×5, native policy:UNKNOWN)

4XESTACKTRACE at java/net/PlainSocketImpl.socketClose0(Native Method)

4XESTACKTRACE at java/net/PlainSocketImpl.socketPreClose(PlainSocketImpl.java:706)

4XESTACKTRACE at java/net/PlainSocketImpl.close(PlainSocketImpl.java:540)

4XESTACKTRACE at java/net/SocksSocketImpl.close(SocksSocketImpl.java:1041)

4XESTACKTRACE at java/net/Socket.close(Socket.java:1343)

4XESTACKTRACE at weblogic/socket/SocketMuxer.closeSocket(SocketMuxer.java:475)

4XESTACKTRACE at weblogic/socket/SocketMuxer.cancelIo(SocketMuxer.java:813)

4XESTACKTRACE at weblogic/socket/SocketMuxer$TimerListenerImpl.timerExpired(SocketMuxer.java:1021(Compiled Code))

4XESTACKTRACE at weblogic/timers/internal/TimerImpl.run(TimerImpl.java:273(Compiled Code))

4XESTACKTRACE at weblogic/work/SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:516(Compiled Code))

4XESTACKTRACE at weblogic/work/ExecuteThread.execute(ExecuteThread.java:201(Compiled Code))

4XESTACKTRACE at weblogic/work/ExecuteThread.run(ExecuteThread.java:173)

3XMTHREADINFO “ExecuteThread: ‘7′ for queue: ‘weblogic.socket.Muxer’” TID:0×35381D00, j9thread_t:0×35385864, state:R, prio=5

3XMTHREADINFO1 (native thread ID:0xB916F, native priority:0×5, native policy:UNKNOWN)

4XESTACKTRACE at weblogic/socket/PosixSocketMuxer.poll(Native Method)

4XESTACKTRACE at weblogic/socket/PosixSocketMuxer.processSockets(PosixSocketMuxer.java:102(Compiled Code))

4XESTACKTRACE at weblogic/socket/SocketReaderRequest.run(SocketReaderRequest.java:29)

4XESTACKTRACE at weblogic/socket/SocketReaderRequest.execute(SocketReaderRequest.java:42)

4XESTACKTRACE at weblogic/kernel/ExecuteThread.execute(ExecuteThread.java:145)

4XESTACKTRACE at weblogic/kernel/ExecuteThread.run(ExecuteThread.java:117)

4. 执行线程只有这两个是running状态,一个做CLOSE(),一个做POLL()。别的都是blocked或者wait状态。

5. 经过metalink查询以及和800支持人员确认,这是Weblogic在AIX的JVM上由来已久的bug,从8.1.4就开始在不同版本间出现。原因是IBM的JVM底层socket实现和weblogic配合问题,需要打patch CR370915_1030GA.jar解决。

操作过程

1.在weblogic的启动脚本中,找到CLASSPATH一行

2.在CLASSPATH变量的第一位添加补丁jar包
Eg: CLASSPATH=”${CLASSPATH}${CLASSPATHSEP}${MEDREC_WEBLOGIC_CLASSPATH}”
—>
CLASSPATH=/路径/CR370915_1030GA.jar:”${CLASSPATH}${CLASSPATHSEP}${MEDREC_WEBLOGIC_CLASSPATH}”

3.以上操作仅对这个domain起作用,为了对所有domain起作用,可以添加到common/bin/的目录中的commEnv.sh文件中WEBLOGIC_CLASSPATH=最前面

总结

这个bug在weblgoic和IBM的JVM相组合的平台上出现较为普遍,如果出现相关日志信息,基本可以断定需要打CR370915补丁。

更新:我这里的补丁仅仅 for weblogic 10.3.0.0,其它版本的可以自行用Smart Update下载

Patches for WLS 8.x can be found in My Oracle Support. Open the Patches & Updates tab. Search for patch ID 8173442 for the patches for WLS 8.1mp3, 8.1mp4, and 8.1mp5. Search for patch ID 8179792 for the patch for WLS 8.1mp6.

Patches for WLS 9.x and higher can be downloaded from Smart Update using these patch IDs and passcodes:

——————————————
PATCH REPOSITORY INFORMATION
——————————————
WLS Version | Patch ID |  Passcode
————+———-+—————-
9.2      |  T4DV    |  7C7PYV9B
9.2mp1   |  HZHQ    |  PTUYCCSI
9.2mp2   |  WJD2    |  GU1CW2AB
9.2mp3   |  GNLT    |  8J9L6Q4Y
10.0     |  PMAJ    |  9UQ69LLT
10.0mp1  |  ITVL    |  K8RBHQQ2
10.3     |  9YT5    |  I1DB5QSV

如果生产机无法联网,可以

1. Using SmartUpdate in offline mode
===========================
You can apply the patch using SmartUpdate with the following steps:
  1. Download the patch using SmartUpdate on another machine with Internet access.
  2. Copy the files (for example E5W8.jar and WGQJ.jar) and patch-catalog.xml from your machine with Internet access to the offline machine. For example, say you have a test environment running on a Windows box. Your production environment is running on UNIX. You might copy the jar files from %BEA_HOME%\utils\bsu\cache-dir to $BEA_HOME/utils/bsu/cache-dir.
  3. When a machine connects to Smart Update, the catalog of patches is always updated automatically. Thus, when a patch is being copied to an offline machine, the patch-catalog.xml file must also be copied over.
  4. Run SmartUpdate in offline mode and apply patches and patch sets. This can be done using the SmartUpdate command-line interface (see http://download.oracle.com/docs/cd/E14759_01/doc.32/e14143/commands.htm#i1074489).
  5. This is the syntax for the command to install a patch:.
/bsu.sh -prod_dir=<weblogic_home> -patchlist=<patchID> -verbose -install
For example,
./bsu.sh -prod_dir=/opt/bea/weblogic92 -patchlist=E5W8 -verbose -install
./bsu.sh -prod_dir=/opt/bea/weblogic92 -patchlist=WGQJ -verbose -install
2. Applying the patch to the classpath manually
============================
  1. You can apply the patch to the offline system manually by extracting the actual patch and adding it to the classpath on the offline system:Extract the actual patch jar file. If you downloaded the patch using SmartUpdate, it will be in the form <patch_id>.jar (for example: E5W8.jar). Inside this jar file is the actual patch jar file, which will be of the form CR326566_92mp3.jar. Extract the latter file for the following steps.
  2. Add the extracted jar file as the first element of the classpath of the Admin server as well as the managed servers in the domain.
  3. If you are starting servers using the WebLogic startup script, update the classpath in the startup script like this:set CLASSPATH=<PATCH_DIR>\jars\CR326566_92mp3.jar;%CLASSPATH% (Windows)CLASSPATH=<PATCH_DIR>/jars/CR326566_92mp3.jar:$CLASSPATH (UNIX)where PATCH_DIR is the directory on your local machine where you extracted/saved the patch file.
  4. Similarly, if you are starting servers using Node Manager, add the patch jar to the beginning of the Class Path argument in the Server Start tab for the server(s).

我一般用第二种,对于单个补丁快捷方便,SmartUpdate可以单独安装,但是会让你选择应用到哪个BEA的主目录,不同的版本和平台能下的补丁不一样。在Windows平台上当然没有AIX的BEA版本,不过只要自己建个目录,然后拷贝一份register.xml进去就可以了。

分类: weblogic, 排错 标签: ,

一次WebSphere性能问题诊断过程

2009年8月24日 hashei 没有评论

一次接到用户电话,说某个应用在并发量稍大的情况下就会出现响应时间陡然增大,同时管理控制台的响应时间也很慢,几乎无法进行正常工作。

赶到现场后,查看平台版本为Webshpere6.0.2.29,操作系统为Windows 2003企业版sp2,于是首先分析systemout.log,发现有如下报错:

= com.ibm.websphere.ce.j2c.ConnectionWaitTimeoutException Max connections reached 869

Exception = com.ibm.websphere.ce.j2c.ConnectionWaitTimeoutException

Source = Max connections reached

probeid = 869

同时也发现有“Caused by: java.io.IOException: Async IO operation failed, reason: RC: 10053 您的主机中的软件放弃了一个已建立的连接。”

很明显是连接池无法分配一个新连接给请求,wait时间过长达到WaitTimeout时间,第一反应是数据库连接池大小开的不够,于是设成初始50,最大150,120S空闲则自动释放连接。

但是调整参数后没有改善,过了不到10分钟应用依旧变慢。再次查看System.out和FFDC里的错误信息,发现有许多关于IO的报错:

com.ibm.ws.webcontainer.channel.WCCByteBufferInputStream 102

Exception = java.net.SocketTimeoutException

Source = com.ibm.ws.webcontainer.channel.WCCByteBufferInputStream

probeid = 102

stack Dump = java.net.SocketTimeoutException: Async operation timed out

java.io.IOException com.ibm.ws.webcontainer.servlet.RequestUtils.parsePostData 398

Exception = java.io.IOException

Source = com.ibm.ws.webcontainer.servlet.RequestUtils.parsePostData

probeid = 398

Stack Dump = java.io.IOException: Async IO operation failed, reason: RC: 55 指定的网络资源或设备不再可用。probeid = 1184

事后才知道其实数据库和中间件之间的问题,但是一来没有报连接池大小不够的错,二来此时管理控制台也几乎无法使用,又结合此应用在操作中会上传许多文件并进行校验,怀疑是服务器的I/O瓶颈导致应用变慢。

        于是在服务器上开启性能工具,添加%Disk time、%Disk Write、%Disk Read、Disk Queue Length、Fage Fault等计数器。发现%Disk Time平均维持在20~70之间,瞬时的Disk Time会达到90多,而且Disk Read值很小,基本是Disk Write。

继续观察了一段时间,发现当磁盘读写下来后,应用还是很慢,于是分析内存回收情况,查看是否有内存泄漏发生。

用IBM Monitoring and Diagnostic Tools for Java™ – Garbage Collection and Memory Visualizer分析发现 Mean interval between collections只有0.46分钟,且内存使用量才250多M就开始GC,而内存参数设置为760~1260M,于是分析内存中的碎片太多,导致GC频繁,使服务的响应速度变慢。同时分析软件得出The mean heap unusable due to fragmentation is estimated at 34.685%,问了应用他们申请内存对象一般是短期的,于是更改GC Policy为Gencon(分代并发),再观察GC日志发现每次回收间隔都较长,而且是young区的回收,Global collections间隔为23分钟。

可惜做了如此的性能优化,情况一点都未改善,在控制台的性能实时检测中可以看到JDBC连接有40~60个繁忙状态,当时无法确定这是否正常,是否是确实需要用到如此多连接。后来应用开发的检测数据库,发现很多active的连接时间长达5到10分钟,内容为一查询语句。原来应用是在Hibernat下开发的,查询语句被它加了自己的函数,导致原先建的索引无法起作用(应用建立索引的时候犯了低级错误),后来重新建立索引后,查询很快完成,连接池繁忙数量降低到0~5,应用恢复正常。原来是数据库的查询时间过长,导致线程都在等待数据库的返回信息,100个线程很快被用完,无法响应新的服务,因为数据库连接池资源已经开大,所以没有这方面的报错。

回顾这一周的排错过程,走了很大的弯路,当时因为经验欠缺没有想到做thread dump,如果做了thread dump的话,应该很容易看到大量的线程在等待数据库的返回,从而定位到数据库问题。

从中我们也看到,最终的问题也许很低级,但是排查的过程会很复杂,因为中间件问题牵扯到主机、网络、数据库、应用等各方面。不过得到的经验就是,在应用响应慢的时候,应该做个thread dump观察线程运行情况,而并非要等到hang住,cpu 100%,OutOfMemory的时候才想起分析javacore。

应用程序死锁导致服务器挂起的介绍

2009年8月17日 hashei 没有评论

原来好东西都躲到Metalink上去了

Problem Description

An inadvertent deadlock in the application code can cause a server to hang. For example, a situation in which thread1 is waiting for resource1 and is holding a lock on resource2, while thread2 needs resource2 and is holding the lock on resource1. Neither thread can progress.

Problem Troubleshooting

This Application Deadlock pattern should be used only after doing all the steps in the Generic Server Hang pattern. One indicator that this is an application deadlock problem is that thread dumps will show the threads are in the application methods. Several thread dumps taken a few seconds apart will show that the threads are not progressing. Troubleshooting this problem will involve reviewing the application code. There exists a thread analyzer tool at BEA dev2dev which has proven useful in analysis of the thread dumps.

Quick Links

阅读全文…

分类: weblogic, 排错 标签: ,

JDBC引发的服务器hang解决思路

2009年8月16日 hashei 2 条评论

这篇也是转自BEA的官方文档,源地址在BEA被Oracle收购后就转到Oracle官网了,所以留为备份。

JDBC Causes Server Hang


Problem Description
A JDBC connection which is used by an application or by WebLogic Server itself will block one WebLogic Server execute thread for the complete duration of the calls that are made via this connection. The JVM will ensure that the CPU is given to runnable threads by its thread scheduling mechanism, while the thread that blocks on a SQL query needs to wait. However, the thread occupied by the JDBC call will be reserved and used for the application until the call returns from the SQL query.

Even a transaction timeout will not kill or timeout any action that is done by the resources that are enlisted in this transaction. The actions will run as long as they take, without interruption. A transaction timeout will set a flag on the transaction that will mark it as rollback only, so that any subsequent request to commit this transaction will fail with a TimedOutException or RollbackException. However, as mentioned above, the long running JDBC calls can lead to blocked WebLogic Server execute threads, which can finally lead to a hanging instance, if all threads are blocked and no execute thread remains available for handling incoming requests.

More recent WebLogic Server versions have a health check functionality that regularly checks if a thread does not react for a certain period of time (the default is 600 seconds). If this happens, an error message is printed to your log file similar to following:


####<Nov 6, 2004 1:42:30 PM EST> <Warning> <WebLogicServer> <mydomain> <myserver> <CoreHealthMonitor>
<kernel identity> <>
<000337> <ExecuteThread: ‘64′ for queue: ‘default’ has been busy for “740″ seconds working on the request “Scheduled Trigger”,
which is more than the configured time (StuckThreadMaxTime) of “600″ seconds.>


This does not interrupt the thread, as this is just a notification for the administrator. The only way a stuck thread becomes unstuck again is when the request it is handling finishes. In this case, you will find a message similar to following in your WebLogic Server’s log file:


####<Nov 7, 2004 4:17:34 PM EST> <Info> <WebLogicServer><mydomain> <myserver> <ExecuteThread: ‘66′
for queue: ‘default’>
<kernel identity> <> <000339> <ExecuteThread: ‘66′ for queue: ‘default’ has become “unstuck”.>


The time interval for the health check functionality is configurable. Please check StuckThreadMaxTime property in the <Server> tag of your config.xml file: http://e-docs.bea.com/wls/docs81/config_xml/Server.html#StuckThreadMaxTime or the “Detecting stuck threads” section in the WebLogic Server administration console help: http://e-docs.bea.com/wls/docs81/perform/WLSTuning.html#stuckthread.

Top of Page

Problem Troubleshooting
Different programming techniques or JDBC connection pool configurations can lead to deadlocks or long running JDBC calls that lead to hanging WebLogic Server instances. General information about how to troubleshoot and analyze a hanging WebLogic Server instance is provided in Generic Server Hang Pattern.

This pattern addresses JDBC calls causing a server hang and other well known JDBC-related causes for common problems leading to hanging WebLogic Server instance.  Other Support Patterns referenced in this pattern are at the WebLogic Server Support Patterns Site.

Quick Links

Why does the problem occur?
The following are some different possible reasons that can cause JDBC calls to lead to a hanging WebLogic Server instance:

Top of Page

Synchronized DriverManager.getConnection()
Older JDBC application code sometimes uses DriverManager.getConnection() calls to retrieve a database connection using a certain driver. This technique is not recommended as it can cause deadlocks or at least relatively low performance for your connection requests. The reason behind this is, that all DriverManager calls are class-synchronized, meaning that one DriverManager call in one thread will block all other DriverManager calls in any other thread inside one WebLogic Server instance.

In addition to that, the constructor for a SQLException makes a DriverManager call, and most drivers have DriverManager.println() calls for logging, so any of these can block all other threads that issue a DriverManager call.

DriverManager.getConnection() can take a relatively long time until it returns with the physical connection created to the database. Even if no deadlock occurs, all other calls need to wait until that one thread gets its connection. This is not a best practice in a multi-threaded system like WebLogic Server.


This information is taken from http://forums.bea.com/bea//thread.jspa?forumID=2022&threadID=200063365&messageID=202311284&start=-1#202311284.
Also our documentation clearly states that DriverManager.getConnection() should not be used: http://e-docs.bea.com/wls/docs81/faq/jdbc.html#501044.

If you prefer to use JDBC connections in your JDBC code, you should use a WebLogic Server JDBC connection pool, define a DataSource for it, and get the connection from the DataSource. This will give you all advantages from a pool (resource sharing, connection reuse, connection refresh if a database was down, etc). It also will help you avoid the deadlocks that may happen with DriverManager calls. See detailed information on how to use JDBC connection pools, DataSources, and other JDBC objects in WebLogic Server at: http://e-docs.bea.com/wls/docs81/jdbc/intro.html#1036718 and http://e-docs.bea.com/wls/docs81/jdbc/programming.html#1054307.

A typical thread blocked in a DriverManager.getConnection() call looks like:

“ExecuteThread-39″ daemon prio=5 tid=0×401660 nid=0×33 waiting for monitor entry [0xd247f000..0xd247fc68]
  at java.sql.DriverManager.getConnection(DriverManager.java:188)
  at com.bla.updateDataInDatabase(MyClass.java:296)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:865)
  at weblogic.servlet.internal.ServletStubImpl.invokeServlet
(ServletStubImpl.java:120)
  at weblogic.servlet.internal.ServletContextImpl.invokeServlet
(ServletContextImpl.java:945)
  at weblogic.servlet.internal.ServletContextImpl.invokeServlet
(ServletContextImpl.java:909)
  at weblogic.servlet.internal.ServletContextManager.invokeServlet
(ServletContextManager.java:269)
  at weblogic.socket.MuxableSocketHTTP.invokeServlet(MuxableSocketHTTP.java:392)
  at weblogic.socket.MuxableSocketHTTP.execute(MuxableSocketHTTP.java:274)
  at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:130)

Top of Page

Long Running SQL Queries
Long running SQL queries block execute threads for their duration and until they return their result to the calling application. This means that a WebLogic Server instance needs to be configured to be able to handle enough calls simultaneously as they are requested by the application load. Limiting factors here are the number of execute threads and the number of connections in the JDBC connection pools. A general rule of thumb is to set the number of connections in the pool equally to the number of execute threads to enable optimal resource utilization. If JTS is used, some more connections in the pools should be available because connections may be reserved for transactions that are actually not active.

A thread hanging in a long running SQL call will show a very similar stack in a thread dump as the one for a hanging database. Please compare the next section for details.

Hanging Database
Good database performance is key for the performance of an application that relies on this database. Consequently, a hanging database can block many or all available execute threads in a WebLogic Server instance and finally lead to a hanging server. To diagnose this, you should take 5 to 10 thread dumps from your hanging WebLogic Server instance and check your execute threads (in the default queue or your application thread queue) to see if they are currently in SQL calls and waiting for a result from the database. A typical stack trace for a thread that currently issues a sql query could look similar to following example:


“ExecuteThread: ‘4′ for queue: ‘weblogic.kernel.Default’” daemon prio=5 tid=0×8e93c8 nid=0×19 runnable [e137f000..e13819bc]
  at java.net.SocketInputStream.socketRead0(Native Method)
  at java.net.SocketInputStream.read(SocketInputStream.java:129)
  at oracle.net.ns.Packet.receive(Unknown Source)
  at oracle.net.ns.DataPacket.receive(Unknown Source)
  at oracle.net.ns.NetInputStream.getNextPacket(Unknown Source)
  at oracle.net.ns.NetInputStream.read(Unknown Source)
  at oracle.net.ns.NetInputStream.read(Unknown Source)
  at oracle.net.ns.NetInputStream.read(Unknown Source)
  at oracle.jdbc.ttc7.MAREngine.unmarshalUB1(MAREngine.java:931)
  at oracle.jdbc.ttc7.MAREngine.unmarshalSB1(MAREngine.java:893)
  at oracle.jdbc.ttc7.Oall7.receive(Oall7.java:375)
  at oracle.jdbc.ttc7.TTC7Protocol.doOall7(TTC7Protocol.java:1983)
  at oracle.jdbc.ttc7.TTC7Protocol.fetch(TTC7Protocol.java:1250)
  – locked <e8c68f00> (a oracle.jdbc.ttc7.TTC7Protocol)
  at oracle.jdbc.driver.OracleStatement.doExecuteQuery(OracleStatement.java:2529)
  at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout
(OracleStatement.java:2857)
  at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate(OraclePreparedStatement.java:608)
  – locked <e5cc44d0> (a oracle.jdbc.driver.OraclePreparedStatement)
  – locked <e8c544c8> (a oracle.jdbc.driver.OracleConnection)
  at oracle.jdbc.driver.OraclePreparedStatement.executeQuery
(OraclePreparedStatement.java:536)
  – locked <e5cc44d0> (a oracle.jdbc.driver.OraclePreparedStatement)
  – locked <e8c544c8> (a oracle.jdbc.driver.OracleConnection)
  at weblogic.jdbc.wrapper.PreparedStatement.executeQuery(PreparedStatement.java:80)
  at myPackage.query.getAnalysis(MyClass.java:94)
  at jsp_servlet._jsp._jspService(__jspService.java:242)
  at weblogic.servlet.jsp.JspBase.service(JspBase.java:33)
  at weblogic.servlet.internal.ServletStubImpl$
ServletInvocationAction.run(ServletStubImpl.java:971)
  at weblogic.servlet.internal.ServletStubImpl.invokeServlet
(ServletStubImpl.java:402)
  at weblogic.servlet.internal.ServletStubImpl.invokeServlet
(ServletStubImpl.java:305)
  at weblogic.servlet.internal.RequestDispatcherImpl.include
(RequestDispatcherImpl.java:607)
  at weblogic.servlet.internal.RequestDispatcherImpl.include
(RequestDispatcherImpl.java:400)
  at weblogic.servlet.jsp.PageContextImpl.include(PageContextImpl.java:154)
  at jsp_servlet._jsp.__mf1924jq._jspService(__mf1924jq.java:563)
  at weblogic.servlet.jsp.JspBase.service(JspBase.java:33)
  at weblogic.servlet.internal.ServletStubImpl$
ServletInvocationAction.run(ServletStubImpl.java:971)
  at weblogic.servlet.internal.ServletStubImpl.invokeServlet
(ServletStubImpl.java:402)
  at weblogic.servlet.internal.ServletStubImpl.invokeServlet
(ServletStubImpl.java:305)
  at weblogic.servlet.internal.WebAppServletContext$
ServletInvocationAction.run(WebAppServletContext.java:6350)
  at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:317)
  at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:118)
  at weblogic.servlet.internal.WebAppServletContext.invokeServlet
(WebAppServletContext.java:3635)
  at weblogic.servlet.internal.ServletRequestImpl.execute(ServletRequestImpl.java:2585)
  at weblogic.kernel.ExecuteThread.execute(ExecuteThread.java:197)
  at weblogic.kernel.ExecuteThread.run(ExecuteThread.java:170)


The thread will be in running state. You should compare the threads in your different thread dumps in order to see if they receive the return from the SQL call in a timely manner or if they hang in this same call for a longer period of time. If the thread dumps seem to imply long response times from SQL calls, the corresponding database logs should be checked to see if problems in the database cause this slow performance or hang situation.

Top of Page

Slow Network
Communication between WebLogic Server and the database relies on a well-performing and reliable network in order to serve the requests in a timely manner. Slow network performance can therefore lead to hanging or blocking execute threads waiting for results of SQL queries. The related stack traces will look similar to example above in Hanging Database section. It is not possible to find the root cause of the hanging or slow SQL queries by solely analyzing the WebLogic Server thread dumps. These give the first hint that something is wrong with the performance of the SQL calls. The next step is to check if there is a database or network problem that causes poorly performing SQL calls.

Deadlock
Both an application level deadlock as well as a deadlock on the database level can lead to hanging threads. You should check your thread dumps to see if there is an application level deadlock. Information on how to do this is provided in Server Hang – Application Deadlock Pattern. A database deadlock can be detected either in the database log or by the SQL Exception that can be found in the WebLogic Server log file. An example for a related SQL Exception is:


java.sql.SQLException: ORA-00060: deadlock detected while waiting for resource
  at oracle.jdbc.dbaccess.DBError.throwSqlException(DBError.java:170)
  at oracle.jdbc.oci8.OCIDBAccess.check_error(OCIDBAccess.java:1614)
  at oracle.jdbc.oci8.OCIDBAccess.executeFetch(OCIDBAccess.java:1225)
  at oracle.jdbc.oci8.OCIDBAccess.parseExecuteFetch(OCIDBAccess.java:1338)
  at oracle.jdbc.driver.OracleStatement.executeNonQuery(OracleStatement.java:1722)
  at oracle.jdbc.driver.OracleStatement.doExecuteOther(OracleStatement.java:1647)
  at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:2167)
  at oracle.jdbc.driver.OraclePreparedStatement.executeUpdate
(OraclePreparedStatement.java:404)


As it generally can take some time until a database detects a deadlock and resolves it by rolling back one or more transactions that cause the deadlock, one or more execute threads will be blocked until the rollback has finished.

RefreshMinutes or TestFrequencySeconds
If you see recurring periods of low database performance, slow SQL calls, or connection peaks, the setting of the RefreshMinutes or TestFrequencySeconds configuration property in your JDBC connection pools could be the reason. This is described in detail in Investigating JDBC Problems Pattern. Unless you do not have a firewall between your WebLogic Server instance and your database, you should disable this functionality.

Pool Shrinking
Physical connections to a database are resources that should be opened once and kept open as long as possible, as a new connection request is a considerable resource overhead for the database, the operating system kernel, and the WebLogic Server. Consequently, pool shrinking should be disabled on production systems in order to keep this overhead at a minimum. If pool shrinking is enabled, idle pool connections will be closed and reopened once connection requests to the pool cannot be satisfied.

As these activities can take some time, the related application requests may take an unexpectedly long time which can lead users to assume that the system hangs. Information on how to optimize JDBC connection pool configurations is provided in Investigating JDBC Problems Pattern.

Top of Page

Analysis of a hanging WebLogic Server instance
General information on how to analyze a hanging WebLogic Server instance is provided in Generic Server Hang Pattern.

Most times it will be helpful to start with taking thread dumps from the hanging system in order to find out what is going on, e.g., what the different threads are doing and why they hang. Generally, thread dumps can be taken on production systems, however caution is necessary for very old versions of the JVM (<1.3.1_09), as they may crash during thread dumps. Also if the WebLogic Server instance has a huge number of threads, it will mean that the thread dump will take awhile to complete, while the rest of the threads are blocked.

Please take more than one thread dump (5 to 10) with a delay of some seconds in between. This gives you the possibility to check the progress of the different threads. Also it will show if the system actually hangs (no progress at all) or if the throughput is extremely slow, which can seem to be a hanging system.

Information on how to take thread dumps is provided in “Generic Server Hang” support pattern or in our documentation: http://e-docs.bea.com/wls/docs81/cluster/trouble.html.

Also please check if the complete WebLogic Server instance hangs or if it is the application that hangs. “Generic Server Hang” support pattern also includes this information.

Analyzing the thread dumps can show if one of the reasons mentioned in the previous section Why does the problem occur? actually is responsible for your hanging instance. If for example all your threads are in a DriverManager method like getConnection() then you have identified the root cause and need to change your application to use a DataSource or Driver.connect() instead of DriverManager.getConnection().

A very useful tool, Samurai, can be used to analyze thread dumps and to monitor the progress of threads between different thread dumps. This can be downloaded from dev2dev at:  http://dev2dev.bea.com/resourcelibrary/utilitiestools/adminmgmt.jsp.

A whitepaper on analyzing thread dumps on dev2dev: http://dev2dev.bea.com/products/wlplatform81/articles/thread_dumps.jsp will also be helpful in going deeper into the thread dumps to find out more about the server hang.

Top of Page

Tips and Tricks to optimize your JDBC code and JDBC connection pool configuration
There are some best practices both in the development of JDBC code and also in the configuration practice of JDBC connection pools that can help to avoid common problems and optimize resource usage so that hanging server instances should not happen.

JDBC Programming
In order to optimize resource usage in WebLogic Server and conserve database resources, you should use JDBC connection pools for your application’s JDBC calls. Connections created and destroyed in your application code generate an unnecessary overhead which should be avoided. For generic documentation on JDBC programming, see: http://e-docs.bea.com/wls/docs81/jdbc/rmidriver.html#1028977. Also details on JDBC performance tuning are at: http://e-docs.bea.com/wls/docs81/jdbc/performance.html#1027791.

You can view comprehensive information on JDBC that will help to optimize your JDBC code and the utilization of your JDBC resources on dev2dev Java Database Connectivity page at: http://dev2dev.bea.com/technologies/jdbc/index.jsp.

JDBC Connection Pool Configuration
The Investigating JDBC Problems Pattern has recommendations on how to configure a connection pool for production environments. In order to avoid hangs or bad performance, these configuration tips should be considered.

Top of Page

Known Issues
You can periodically review the Release Notes for your version of WLS for more information on Known Issues or Resolved Issues in Service Packs and browse for JDBC server hang-related issues.  For your convenience, see the following:

Please note that changes have been made in WLS 8.1 SP3 to resolve CR134921, where for certain JDBC connections, the call to roll back a transaction was not being handled immediately because the driver had to wait for any currently-executing statement to return. 

Searching will also return Release Notes, as well as other Support Solutions and CR-related information as noted at Need Further Help?.  Contract customers who are logged in at
http://support.bea.com/ will also see a Browse portlet for both Solutions and Bug Central where latest available CRs can be browsed by Product version.


Need Further Help?
If you have followed the pattern, but still require additional help, you can:
  1. Query AskBEA at http://support.bea.com/ using “jdbc server hang”, as an example, to discover other published solutions.  Contract Support Customers: Ensure you are logged to access available CR-related information.
  2. Ask a more detailed question on one of BEA’s newsgroups at http://forums.bea.com

If this does not resolve your issue and you have a valid Support Contract, you can open a Support Case by logging in at: http://support.bea.com/ .


FEEDBACK

Please provide us input on whether or not this Support Diagnostic Pattern “JDBC Causes Server Hang” helped, any clarifications you needed, and any requests for new topics to Support Diagnostic Patterns.



DISCLAIMER NOTICE:

BEA Systems, Inc. provides the technical tips and patches on this Website for your use under the terms of BEA’s maintenance and support agreement with you. While you may use this information and code in connection with software you have licensed from BEA, BEA makes no warranty of any kind, express or implied, regarding the technical tips and patches.

Any trademarks referenced in this document are the property of their respective owners. Consult your product manuals for complete trademark information.

分类: 性能优化, 排错 标签: , , ,