Fault analysis on large, complex communication systems using analytics techniques
Abstract
Nowadays, computing and storage systems, which contain many servers and
network devices, grow up more and more, such as cloud computing systems, software
defined networks, content delivery networks. Managing these systems is a challenge due
to their scalability, heterogeneity and importance, while administrator's expertise and
supporting tools are limited. Among several management functions, managing faults
occurring on these systems is difficult, while faults cause the interrupt of systems and
services seriously. There is always a demand of developing techniques and tools that
insist administrators in managing faults.
There have been several research activities that focus on event monitoring, fault
detection, analysis and resolution. This paper aims to study analytics technique especially
Random Forest for fault analysis. While the existing techniques usually analyze log events,
messages, and trace and use administrator's expertise to detect and solve faults. These
techniques heavily depend on human being. Applying analytics technique to analyze
faults helps providing significant facts for administrator to deal with faults and thus
reducing the dependence of human being.
Analytics techniques are big range in computer science with many various
algorithms and Random Forest is only a part of them. Random Forest, which is seen as
“young” to other algorithms like neural network, Bayesian, K-means, etc., has many
strong features, but it also has some weakness. My research focus on analyzing fault
particularly bugs report where no one applies Random Forest in the previous. With the
limited hardware and noisy data set, the result of my study cannot archive the best
outcome. Therefore, I need to improve the efficiency and the performance in the future