k****t 发帖数: 184 | 1 烙印: if the production application hang, how do you find out what caused
the problem?
我: I will dump heap to ... (烙印打断: "suppose it's production, you can't
stop it.")
我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.")
我: I will check memory usage, if ...(烙印打断: "suppose not memory used up.
")
我: I will check if there is I/O blocking ...(烙印打断: "suppose not because
of I/O blocking.")
我: That's my way to analyze issue, I will rule out something to ...
(烙印打断 "ok, let's suppose it's I/O issue, but it's million line code
application, and could be thousands of part involves I/O, how do you solve
the problem?)
我: For this huge application, troubleshooting needs deep understanding of
the codes.
(烙印打断: suppose you know the code very well, and suppose you wrote the
code yourself.)
我沉默,烙印沉默,(我在想:他一定是想要一个明确的答案,也就是一句话一针见血
的回答,可我没有答案...陷入长考)
一分钟后
烙印: it's ok, let's move on to next question.
我现在也想不出答案,请高手指点
谢谢! |
w**a 发帖数: 487 | 2 新手:这个application有log么? 能不能在程序还在运行的时候就查看log呢?
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
k****t 发帖数: 184 | 3 这个我没想到,所以当时没问。
【在 w**a 的大作中提到】 : 新手:这个application有log么? 能不能在程序还在运行的时候就查看log呢? : : up. : because
|
h*******9 发帖数: 46 | 4 Check logs... 如果可以的话 可以check database. 如果是service 还可以check
monitor. 一般的service 都应该有 logs and monitor services. 不过说实话 如果是
一个 application。 你说的其实都没有问题。 因为application 一般都不存在说不能
暂停的情况。 烙印有意或者无意的 说成application吧 |
g***s 发帖数: 3811 | 5 log当然是首选,但大部分情况估计看不出hang的问题;
thread dump 是最先应该考虑的。我估计这是他需要的答案. kill -3 $pid |
g*****g 发帖数: 34805 | 6 I would set up metrics to cover the frequent API calls for both volume and
latency. I would have the metrics logged to a separate server and displayed
on a timeline chart, and alerts to warn me
if the volume/latency is over certain threshold compared to history. I
would even set up circuit breaker if self recovery is possible. It
should be pretty easy to narrow down which call is causing trouble. there
are open source tools on all these.
The key is to prepare, not react on such accidents. If there is a number you
want to know when it hangs, you should build it before hand.
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
f******o 发帖数: 102 | 7 他既然说了suppose you wrote the code and know the code very well, 那就是从代
码角度入手。 看从什么时候出现问题, 然后找出culprit change list, revert that
change list or revert the deployment. |
c***d 发帖数: 996 | 8 已经说是app server IO了。。
这个东西其实更适合sre, 即使你对code一点不知道, 放一个io busy的production
server, 你也应该30分钟内找到那个有问题的函数。
displayed
you
【在 g*****g 的大作中提到】 : I would set up metrics to cover the frequent API calls for both volume and : latency. I would have the metrics logged to a separate server and displayed : on a timeline chart, and alerts to warn me : if the volume/latency is over certain threshold compared to history. I : would even set up circuit breaker if self recovery is possible. It : should be pretty easy to narrow down which call is causing trouble. there : are open source tools on all these. : The key is to prepare, not react on such accidents. If there is a number you : want to know when it hangs, you should build it before hand. :
|
w***x 发帖数: 105 | |
g*****g 发帖数: 34805 | 10 真到hang了除了kill -3看threads都在干啥没啥好弄的。thread都被吃掉了没啥log都
不奇怪。
【在 c***d 的大作中提到】 : 已经说是app server IO了。。 : 这个东西其实更适合sre, 即使你对code一点不知道, 放一个io busy的production : server, 你也应该30分钟内找到那个有问题的函数。 : : displayed : you
|
|
|
p*****y 发帖数: 529 | 11 this is not a pure technical question. By stretching you in a "rude" way, he
tried to find out how you perform under pressure and whether you can keep
calm and manage the conversation going even if the other party is not, which
is very typical when you are in a real production support scenario. At
least, this is how I typically use those kind of questions.
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
J****n 发帖数: 937 | 12 这种问题很无聊,解决的方法有很多种,要根据实际情况选择。面试的人脑袋里就想着
一个答案,或者就知道一个答案,你不选他的答案就是不对。这是一种非常傻X的面试
方法,绝大多数情况下显示面试的人根本不懂他自己问的问题,或者就只知道一个答案。
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
a**n 发帖数: 313 | 13 cpu没有用完或甚至没有单个cpu100%, 估计要印想要问你,用jstack 或kill3,
jvisual, 或别的profile tool去attach到那个process看有无deadlock之类.
老印估计是customer support person. 这个是经验问题,所以还是有点黑你。
不过jmap heap dump will not stop process, 所以老印自己也不懂。
【在 g*****g 的大作中提到】 : 真到hang了除了kill -3看threads都在干啥没啥好弄的。thread都被吃掉了没啥log都 : 不奇怪。
|
b********n 发帖数: 5997 | 14 you should say, 'suppose you shut yr f**k up, everything will be fine.'
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
b******l 发帖数: 860 | 15 这个是正解。凡是甩脸子恼羞成怒的都应该面壁。你要是在生产环境中碰到outage,
director/vp都在线上抓狂的话就知道对付这样的问题有多么司空见惯了。
he
which
【在 p*****y 的大作中提到】 : this is not a pure technical question. By stretching you in a "rude" way, he : tried to find out how you perform under pressure and whether you can keep : calm and manage the conversation going even if the other party is not, which : is very typical when you are in a real production support scenario. At : least, this is how I typically use those kind of questions. : : up. : because
|
w***x 发帖数: 105 | 16 对程序员来说,最常规的回答就是gdb上去,弄个core dump出来慢慢研究...
感觉问这种傻问题的,不是没写过程序就是估计找茬,都不是的话,就是神经病 |
s*******e 发帖数: 1630 | 17 都说假设是你自己的code,你就说自己怎么写instrumentation来帮助live site debug
啊,如果他说假如你没logging,你就说我写prod codes一定有logging,否则不是合格
的prod codes |
b*********r 发帖数: 651 | 18 你应该让他把他的suppose都一次性说出来... |
v*****1 发帖数: 2200 | 19 完全不懂,但肯定是黑你,不要抱有幻想
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
w********s 发帖数: 1570 | 20 ptrace, strace
procfs 里查status, locks, context switches
你看上去缺乏实践,只刷题了?
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
|
|
w********s 发帖数: 1570 | 21 他会告诉你prod里的东西不能随便kill,log是第一个能看的。
如果你能kill,何不gdb attach上?
他的含义是prod里没有gdb
【在 g***s 的大作中提到】 : log当然是首选,但大部分情况估计看不出hang的问题; : thread dump 是最先应该考虑的。我估计这是他需要的答案. kill -3 $pid
|
w********s 发帖数: 1570 | 22 人家说的是prod,不是qa环境你可以随便折腾。
displayed
you
【在 g*****g 的大作中提到】 : I would set up metrics to cover the frequent API calls for both volume and : latency. I would have the metrics logged to a separate server and displayed : on a timeline chart, and alerts to warn me : if the volume/latency is over certain threshold compared to history. I : would even set up circuit breaker if self recovery is possible. It : should be pretty easy to narrow down which call is causing trouble. there : are open source tools on all these. : The key is to prepare, not react on such accidents. If there is a number you : want to know when it hangs, you should build it before hand. :
|
w********s 发帖数: 1570 | 23 这个就是个技术问题,来区分刷题的,还是有经验的
实际上,这个问题很能看出你的水平有多少
nb点的你可以根据procfs和ptrace模拟出一个类似gdb
he
which
【在 p*****y 的大作中提到】 : this is not a pure technical question. By stretching you in a "rude" way, he : tried to find out how you perform under pressure and whether you can keep : calm and manage the conversation going even if the other party is not, which : is very typical when you are in a real production support scenario. At : least, this is how I typically use those kind of questions. : : up. : because
|
g*****g 发帖数: 34805 | 24 我说的当然是prod的做法。
【在 w********s 的大作中提到】 : 人家说的是prod,不是qa环境你可以随便折腾。 : : displayed : you
|
g*****g 发帖数: 34805 | 25 莫非你以为kill -3是杀进程?
【在 w********s 的大作中提到】 : 他会告诉你prod里的东西不能随便kill,log是第一个能看的。 : 如果你能kill,何不gdb attach上? : 他的含义是prod里没有gdb
|
d****n 发帖数: 1637 | 26 我觉得楼主回答的已经很专业了。烙印可能想要些三角猫的功夫。
先看disk io
iostat?
在看database
再看network
netstat?
确定是那个问题,如果是设计问题, 再回到kill -3, core dump -> gdb.
但是话说回来, 如果production 没有楼主提供的那些方法,真他妈叫狗屎prod,是来
给人擦腚吧。
上production最好要有system monitor(appdynamics.com)之类的服务。
没有的话就suppose 没这个,没那个吧,哈哈 |
c***z 发帖数: 6348 | 27 Exactly, the right answer is "it depends".
Also, it is very rude to interrupt people, you are probably stabbed.
案。
【在 J****n 的大作中提到】 : 这种问题很无聊,解决的方法有很多种,要根据实际情况选择。面试的人脑袋里就想着 : 一个答案,或者就知道一个答案,你不选他的答案就是不对。这是一种非常傻X的面试 : 方法,绝大多数情况下显示面试的人根本不懂他自己问的问题,或者就只知道一个答案。 : : up. : because
|
k**0 发帖数: 19737 | 28 program log + email notification. Also setup server notification in case
program hangs.
对付这种只会嘴的阿三不需要从技术detail上说问题。
技术员工的最大问题就是太技术, 想向上发展一定要会看人说话。
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
a****l 发帖数: 8211 | 29 这个第二段是正解。
displayed
you
【在 g*****g 的大作中提到】 : I would set up metrics to cover the frequent API calls for both volume and : latency. I would have the metrics logged to a separate server and displayed : on a timeline chart, and alerts to warn me : if the volume/latency is over certain threshold compared to history. I : would even set up circuit breaker if self recovery is possible. It : should be pretty easy to narrow down which call is causing trouble. there : are open source tools on all these. : The key is to prepare, not react on such accidents. If there is a number you : want to know when it hangs, you should build it before hand. :
|
j******o 发帖数: 4219 | 30 这种问题每个系统和程序都有不同的回答,谈到具体怎么做就是扯淡,你就知道你的系
统一定有kill -3?
具体要怎么做在设计阶段就已经决定了,log和deamon是比较普遍的做法。 |
|
|
l*********u 发帖数: 19053 | 31 对code很熟的话,就应该知道app做哪几件事。按顺序查,很快就可以查出hang在哪里。
up.
because
【在 k****t 的大作中提到】 : 烙印: if the production application hang, how do you find out what caused : the problem? : 我: I will dump heap to ... (烙印打断: "suppose it's production, you can't : stop it.") : 我: I check CPU, if CPU is busy... (烙印打断: "suppose CPU is not busy.") : 我: I will check memory usage, if ...(烙印打断: "suppose not memory used up. : ") : 我: I will check if there is I/O blocking ...(烙印打断: "suppose not because : of I/O blocking.") : 我: That's my way to analyze issue, I will rule out something to ...
|
b*******e 发帖数: 4483 | 32 他要问的就是这个,你没答对哈
【在 k****t 的大作中提到】 : 这个我没想到,所以当时没问。
|
i****k 发帖数: 668 | 33 可是你咋知道一定是Java呢...万一前任悄悄地handle了它咋办呢
【在 g*****g 的大作中提到】 : 莫非你以为kill -3是杀进程?
|
f*******s 发帖数: 182 | 34 Dump stack trace. It might be you have an infinite loop in code. |