Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix sign fields in ftrace_define_fields_##call() tracing/syscalls: Fix typo in SYSCALL_DEFINE0 tracing/kprobe: Show sign of fields in trace_kprobe format files ksym_tracer: Remove trace_stat ksym_tracer: Fix race when incrementing count ksym_tracer: Fix to allow writing newline to ksym_trace_filter ksym_tracer: Fix to make the tracer work tracing: Kconfig spelling fixes and cleanups tracing: Fix setting tracer specific options Documentation: Update ftrace-design.txt Documentation: Update tracepoint-analysis.txt Documentation: Update mmiotrace.txt
This commit is contained in:
@@ -53,14 +53,14 @@ size of the mcount call that is embedded in the function).
|
||||
For example, if the function foo() calls bar(), when the bar() function calls
|
||||
mcount(), the arguments mcount() will pass to the tracer are:
|
||||
"frompc" - the address bar() will use to return to foo()
|
||||
"selfpc" - the address bar() (with _mcount() size adjustment)
|
||||
"selfpc" - the address bar() (with mcount() size adjustment)
|
||||
|
||||
Also keep in mind that this mcount function will be called *a lot*, so
|
||||
optimizing for the default case of no tracer will help the smooth running of
|
||||
your system when tracing is disabled. So the start of the mcount function is
|
||||
typically the bare min with checking things before returning. That also means
|
||||
the code flow should usually kept linear (i.e. no branching in the nop case).
|
||||
This is of course an optimization and not a hard requirement.
|
||||
typically the bare minimum with checking things before returning. That also
|
||||
means the code flow should usually be kept linear (i.e. no branching in the nop
|
||||
case). This is of course an optimization and not a hard requirement.
|
||||
|
||||
Here is some pseudo code that should help (these functions should actually be
|
||||
implemented in assembly):
|
||||
@@ -131,10 +131,10 @@ some functions to save (hijack) and restore the return address.
|
||||
|
||||
The mcount function should check the function pointers ftrace_graph_return
|
||||
(compare to ftrace_stub) and ftrace_graph_entry (compare to
|
||||
ftrace_graph_entry_stub). If either of those are not set to the relevant stub
|
||||
ftrace_graph_entry_stub). If either of those is not set to the relevant stub
|
||||
function, call the arch-specific function ftrace_graph_caller which in turn
|
||||
calls the arch-specific function prepare_ftrace_return. Neither of these
|
||||
function names are strictly required, but you should use them anyways to stay
|
||||
function names is strictly required, but you should use them anyway to stay
|
||||
consistent across the architecture ports -- easier to compare & contrast
|
||||
things.
|
||||
|
||||
@@ -144,7 +144,7 @@ but the first argument should be a pointer to the "frompc". Typically this is
|
||||
located on the stack. This allows the function to hijack the return address
|
||||
temporarily to have it point to the arch-specific function return_to_handler.
|
||||
That function will simply call the common ftrace_return_to_handler function and
|
||||
that will return the original return address with which, you can return to the
|
||||
that will return the original return address with which you can return to the
|
||||
original call site.
|
||||
|
||||
Here is the updated mcount pseudo code:
|
||||
|
@@ -44,7 +44,8 @@ Check for lost events.
|
||||
Usage
|
||||
-----
|
||||
|
||||
Make sure debugfs is mounted to /sys/kernel/debug. If not, (requires root privileges)
|
||||
Make sure debugfs is mounted to /sys/kernel/debug.
|
||||
If not (requires root privileges):
|
||||
$ mount -t debugfs debugfs /sys/kernel/debug
|
||||
|
||||
Check that the driver you are about to trace is not loaded.
|
||||
@@ -91,7 +92,7 @@ $ dmesg > dmesg.txt
|
||||
$ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt
|
||||
and then send the .tar.gz file. The trace compresses considerably. Replace
|
||||
"pciid" and "nick" with the PCI ID or model name of your piece of hardware
|
||||
under investigation and your nick name.
|
||||
under investigation and your nickname.
|
||||
|
||||
|
||||
How Mmiotrace Works
|
||||
@@ -100,7 +101,7 @@ How Mmiotrace Works
|
||||
Access to hardware IO-memory is gained by mapping addresses from PCI bus by
|
||||
calling one of the ioremap_*() functions. Mmiotrace is hooked into the
|
||||
__ioremap() function and gets called whenever a mapping is created. Mapping is
|
||||
an event that is recorded into the trace log. Note, that ISA range mappings
|
||||
an event that is recorded into the trace log. Note that ISA range mappings
|
||||
are not caught, since the mapping always exists and is returned directly.
|
||||
|
||||
MMIO accesses are recorded via page faults. Just before __ioremap() returns,
|
||||
@@ -122,11 +123,11 @@ Trace Log Format
|
||||
----------------
|
||||
|
||||
The raw log is text and easily filtered with e.g. grep and awk. One record is
|
||||
one line in the log. A record starts with a keyword, followed by keyword
|
||||
dependant arguments. Arguments are separated by a space, or continue until the
|
||||
one line in the log. A record starts with a keyword, followed by keyword-
|
||||
dependent arguments. Arguments are separated by a space, or continue until the
|
||||
end of line. The format for version 20070824 is as follows:
|
||||
|
||||
Explanation Keyword Space separated arguments
|
||||
Explanation Keyword Space-separated arguments
|
||||
---------------------------------------------------------------------------
|
||||
|
||||
read event R width, timestamp, map id, physical, value, PC, PID
|
||||
@@ -136,7 +137,7 @@ iounmap event UNMAP timestamp, map id, PC, PID
|
||||
marker MARK timestamp, text
|
||||
version VERSION the string "20070824"
|
||||
info for reader LSPCI one line from lspci -v
|
||||
PCI address map PCIDEV space separated /proc/bus/pci/devices data
|
||||
PCI address map PCIDEV space-separated /proc/bus/pci/devices data
|
||||
unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID
|
||||
|
||||
Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual
|
||||
|
@@ -10,8 +10,8 @@ Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
|
||||
creating custom kernel modules to register probe functions using the event
|
||||
tracing infrastructure.
|
||||
|
||||
Simplistically, tracepoints will represent an important event that when can
|
||||
be taken in conjunction with other tracepoints to build a "Big Picture" of
|
||||
Simplistically, tracepoints represent important events that can be
|
||||
taken in conjunction with other tracepoints to build a "Big Picture" of
|
||||
what is going on within the system. There are a large number of methods for
|
||||
gathering and interpreting these events. Lacking any current Best Practises,
|
||||
this document describes some of the methods that can be used.
|
||||
@@ -33,12 +33,12 @@ calling
|
||||
|
||||
will give a fair indication of the number of events available.
|
||||
|
||||
2.2 PCL
|
||||
2.2 PCL (Performance Counters for Linux)
|
||||
-------
|
||||
|
||||
Discovery and enumeration of all counters and events, including tracepoints
|
||||
Discovery and enumeration of all counters and events, including tracepoints,
|
||||
are available with the perf tool. Getting a list of available events is a
|
||||
simple case of
|
||||
simple case of:
|
||||
|
||||
$ perf list 2>&1 | grep Tracepoint
|
||||
ext4:ext4_free_inode [Tracepoint event]
|
||||
@@ -49,19 +49,19 @@ simple case of
|
||||
[ .... remaining output snipped .... ]
|
||||
|
||||
|
||||
2. Enabling Events
|
||||
3. Enabling Events
|
||||
==================
|
||||
|
||||
2.1 System-Wide Event Enabling
|
||||
3.1 System-Wide Event Enabling
|
||||
------------------------------
|
||||
|
||||
See Documentation/trace/events.txt for a proper description on how events
|
||||
can be enabled system-wide. A short example of enabling all events related
|
||||
to page allocation would look something like
|
||||
to page allocation would look something like:
|
||||
|
||||
$ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
|
||||
|
||||
2.2 System-Wide Event Enabling with SystemTap
|
||||
3.2 System-Wide Event Enabling with SystemTap
|
||||
---------------------------------------------
|
||||
|
||||
In SystemTap, tracepoints are accessible using the kernel.trace() function
|
||||
@@ -86,7 +86,7 @@ were allocating the pages.
|
||||
print_count()
|
||||
}
|
||||
|
||||
2.3 System-Wide Event Enabling with PCL
|
||||
3.3 System-Wide Event Enabling with PCL
|
||||
---------------------------------------
|
||||
|
||||
By specifying the -a switch and analysing sleep, the system-wide events
|
||||
@@ -107,16 +107,16 @@ for a duration of time can be examined.
|
||||
Similarly, one could execute a shell and exit it as desired to get a report
|
||||
at that point.
|
||||
|
||||
2.4 Local Event Enabling
|
||||
3.4 Local Event Enabling
|
||||
------------------------
|
||||
|
||||
Documentation/trace/ftrace.txt describes how to enable events on a per-thread
|
||||
basis using set_ftrace_pid.
|
||||
|
||||
2.5 Local Event Enablement with PCL
|
||||
3.5 Local Event Enablement with PCL
|
||||
-----------------------------------
|
||||
|
||||
Events can be activate and tracked for the duration of a process on a local
|
||||
Events can be activated and tracked for the duration of a process on a local
|
||||
basis using PCL such as follows.
|
||||
|
||||
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
|
||||
@@ -131,18 +131,18 @@ basis using PCL such as follows.
|
||||
|
||||
0.973913387 seconds time elapsed
|
||||
|
||||
3. Event Filtering
|
||||
4. Event Filtering
|
||||
==================
|
||||
|
||||
Documentation/trace/ftrace.txt covers in-depth how to filter events in
|
||||
ftrace. Obviously using grep and awk of trace_pipe is an option as well
|
||||
as any script reading trace_pipe.
|
||||
|
||||
4. Analysing Event Variances with PCL
|
||||
5. Analysing Event Variances with PCL
|
||||
=====================================
|
||||
|
||||
Any workload can exhibit variances between runs and it can be important
|
||||
to know what the standard deviation in. By and large, this is left to the
|
||||
to know what the standard deviation is. By and large, this is left to the
|
||||
performance analyst to do it by hand. In the event that the discrete event
|
||||
occurrences are useful to the performance analyst, then perf can be used.
|
||||
|
||||
@@ -166,7 +166,7 @@ In the event that some higher-level event is required that depends on some
|
||||
aggregation of discrete events, then a script would need to be developed.
|
||||
|
||||
Using --repeat, it is also possible to view how events are fluctuating over
|
||||
time on a system wide basis using -a and sleep.
|
||||
time on a system-wide basis using -a and sleep.
|
||||
|
||||
$ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
|
||||
-e kmem:mm_pagevec_free \
|
||||
@@ -180,7 +180,7 @@ time on a system wide basis using -a and sleep.
|
||||
|
||||
1.002251757 seconds time elapsed ( +- 0.005% )
|
||||
|
||||
5. Higher-Level Analysis with Helper Scripts
|
||||
6. Higher-Level Analysis with Helper Scripts
|
||||
============================================
|
||||
|
||||
When events are enabled the events that are triggering can be read from
|
||||
@@ -190,11 +190,11 @@ be gathered on-line as appropriate. Examples of post-processing might include
|
||||
|
||||
o Reading information from /proc for the PID that triggered the event
|
||||
o Deriving a higher-level event from a series of lower-level events.
|
||||
o Calculate latencies between two events
|
||||
o Calculating latencies between two events
|
||||
|
||||
Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
|
||||
script that can read trace_pipe from STDIN or a copy of a trace. When used
|
||||
on-line, it can be interrupted once to generate a report without existing
|
||||
on-line, it can be interrupted once to generate a report without exiting
|
||||
and twice to exit.
|
||||
|
||||
Simplistically, the script just reads STDIN and counts up events but it
|
||||
@@ -212,12 +212,12 @@ also can do more such as
|
||||
processes, the parent process responsible for creating all the helpers
|
||||
can be identified
|
||||
|
||||
6. Lower-Level Analysis with PCL
|
||||
7. Lower-Level Analysis with PCL
|
||||
================================
|
||||
|
||||
There may also be a requirement to identify what functions with a program
|
||||
There may also be a requirement to identify what functions within a program
|
||||
were generating events within the kernel. To begin this sort of analysis, the
|
||||
data must be recorded. At the time of writing, this required root
|
||||
data must be recorded. At the time of writing, this required root:
|
||||
|
||||
$ perf record -c 1 \
|
||||
-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
|
||||
@@ -253,11 +253,11 @@ perf report.
|
||||
# (For more details, try: perf report --sort comm,dso,symbol)
|
||||
#
|
||||
|
||||
According to this, the vast majority of events occured triggered on events
|
||||
within the VDSO. With simple binaries, this will often be the case so lets
|
||||
According to this, the vast majority of events triggered on events
|
||||
within the VDSO. With simple binaries, this will often be the case so let's
|
||||
take a slightly different example. In the course of writing this, it was
|
||||
noticed that X was generating an insane amount of page allocations so lets look
|
||||
at it
|
||||
noticed that X was generating an insane amount of page allocations so let's look
|
||||
at it:
|
||||
|
||||
$ perf record -c 1 -f \
|
||||
-e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
|
||||
@@ -280,8 +280,8 @@ This was interrupted after a few seconds and
|
||||
# (For more details, try: perf report --sort comm,dso,symbol)
|
||||
#
|
||||
|
||||
So, almost half of the events are occuring in a library. To get an idea which
|
||||
symbol.
|
||||
So, almost half of the events are occurring in a library. To get an idea which
|
||||
symbol:
|
||||
|
||||
$ perf report --sort comm,dso,symbol
|
||||
# Samples: 27666
|
||||
@@ -297,7 +297,7 @@ symbol.
|
||||
0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
|
||||
0.00% Xorg [kernel] [k] ftrace_trace_userstack
|
||||
|
||||
To see where within the function pixmanFillsse2 things are going wrong
|
||||
To see where within the function pixmanFillsse2 things are going wrong:
|
||||
|
||||
$ perf annotate pixmanFillsse2
|
||||
[ ... ]
|
||||
|
Reference in New Issue
Block a user