Commit Graph

744 Commits

Author SHA1 Message Date
Suzuki Poulose
0f890c8d20 powerpc: Rename mapping based RELOCATABLE to DYNAMIC_MEMSTART for BookE
The current implementation of CONFIG_RELOCATABLE in BookE is based
on mapping the page aligned kernel load address to KERNELBASE. This
approach however is not enough for platforms, where the TLB page size
is large (e.g, 256M on 44x). So we are renaming the RELOCATABLE used
currently in BookE to DYNAMIC_MEMSTART to reflect the actual method.

The CONFIG_RELOCATABLE for PPC32(BookE) based on processing of the
dynamic relocations will be introduced in the later in the patch series.

This change would allow the use of the old method of RELOCATABLE for
platforms which can afford to enforce the page alignment (platforms with
smaller TLB size).

Changes since v3:

* Introduced a new config, NONSTATIC_KERNEL, to denote a kernel which is
  either a RELOCATABLE or DYNAMIC_MEMSTART(Suggested by: Josh Boyer)

Suggested-by: Scott Wood <scottwood@freescale.com>
Tested-by: Scott Wood <scottwood@freescale.com>

Signed-off-by: Suzuki K. Poulose <suzuki@in.ibm.com>
Cc: Scott Wood <scottwood@freescale.com>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linux ppc dev <linuxppc-dev@lists.ozlabs.org>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
2011-12-20 10:20:19 -05:00
David Rientjes
2011b1d0d3 powerpc/mm: Fix section mismatch for read_n_cells
read_n_cells() cannot be marked as .devinit.text since it is referenced
from two functions that are not in that section: of_get_lmb_size() and
hot_add_drconf_scn_to_nid().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-19 14:41:18 +11:00
David Rientjes
28e86bdbc9 powerpc/mm: Fix section mismatch for mark_reserved_regions_for_nid
mark_reserved_regions_for_nid() is only called from do_init_bootmem(),
which is in .init.text, so it must be in the same section to avoid a
section mismatch warning.

Reported-by: Subrata Modak <subrata@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-19 14:41:16 +11:00
Benjamin Herrenschmidt
1e7342e778 Merge remote-tracking branch 'jwb/next' into next
Conflicts:
	arch/powerpc/platforms/40x/ppc40x_simple.c
2011-12-16 11:24:25 +11:00
Christoph Egger
ca899859f1 powerpc/44x: Removing dead CONFIG_PPC47x
CONFIG_PPC47x doesn't exist in Kconfig and no 476 processor calls this
function ppc44x_pin_tlb() as it has it's own ppc47x_pin_tlb().

This code is probably an artifact of the original 476 code that
shouldn't have made it upstream.

Signed-off-by: Christoph Egger <siccegge@cs.fau.de>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Josh Boyer <jwboyer@gmail.com>
2011-12-09 07:48:56 -05:00
sukadev@linux.vnet.ibm.com
8a3e3d31d1 powerpc: Punch a hole in /dev/mem for librtas
With CONFIG_STRICT_DEVMEM=y, user space cannot read any part of /dev/mem.
Since this breaks librtas, punch a hole in /dev/mem to allow access to the
rmo_buffer that librtas needs.

Anton Blanchard reported the problem and helped with the fix.

A quick test for this patch:

       # cat /proc/rtas/rmo_buffer
       000000000f190000 10000

       # python -c "print 0x000000000f190000 / 0x10000"
       3865

       # dd if=/dev/mem of=/tmp/foo count=1 bs=64k skip=3865
       1+0 records in
       1+0 records out
       65536 bytes (66 kB) copied, 0.000205235 s, 319 MB/s

       # dd if=/dev/mem of=/tmp/foo
       dd: reading `/dev/mem': Operation not permitted
       0+0 records in
       0+0 records out
       0 bytes (0 B) copied, 0.00022519 s, 0.0 kB/s

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-08 14:22:52 +11:00
Becky Bruce
d93e4d7d72 powerpc/book3e: Change hugetlb preload to take vma argument
This avoids an extra find_vma() and is less error-prone.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:24 +11:00
Becky Bruce
a6146888be powerpc: Add gpages reservation code for 64-bit FSL BOOKE
For 64-bit FSL_BOOKE implementations, gigantic pages need to be
reserved at boot time by the memblock code based on the command line.
This adds the call that handles the reservation, and fixes some code
comments.

It also removes the previous pr_err when reserve_hugetlb_gpages
is called on a system without hugetlb enabled - the way the code is
structured, the call is unconditional and the resulting error message
spurious and confusing.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:23 +11:00
Becky Bruce
d1b9b12811 powerpc: Add hugepage support to 64-bit tablewalk code for FSL_BOOK3E
Before hugetlb, at each level of the table, we test for
!0 to determine if we have a valid table entry.  With hugetlb, this
compare becomes:
        < 0 is a normal entry
        0 is an invalid entry
        > 0 is huge

This works because the hugepage code pulls the top bit off the entry
(which for non-huge entries always has the top bit set) as an
indicator that we have a hugepage.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:22 +11:00
Becky Bruce
27609a42ee powerpc: Whitespace/comment changes to tlb_low_64e.S
I happened to comment this code while I was digging through it;
we might as well commit that.  I also made some whitespace
changes - the existing code had a lot of unnecessary newlines
that I found annoying when I was working on my tiny laptop.

No functional changes.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:22 +11:00
Becky Bruce
881fde1db5 powerpc: hugetlb: modify include usage for FSL BookE code
The original 32-bit hugetlb implementation used PPC64 vs PPC32 to
determine which code path to take.  However, the final hugetlb
implementation for 64-bit FSL ended up shared with the FSL
32-bit code so the actual check needs to be FSL_BOOK3E vs
everything else.  This patch changes the include protections to
reflect this.

There are also a couple of related comment fixes.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:22 +11:00
Becky Bruce
a1cd541988 powerpc: Update hugetlb huge_pte_alloc and tablewalk code for FSL BOOKE
This updates the hugetlb page table code to handle 64-bit FSL_BOOKE.
The previous 32-bit work counted on the inner levels of the page table
collapsing.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:22 +11:00
Becky Bruce
8c1674de2b powerpc: Fix booke hugetlb preload code for PPC_MM_SLICES and 64-bit
This patch does 2 things: It corrects the code that determines the
size to write into MAS1 for the PPC_MM_SLICES case (this originally
came from David Gibson and I had incorrectly altered it), and it
changes the methodolody used to calculate the size for !PPC_MM_SLICES
to work for 64-bit as well as 32-bit.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:21 +11:00
Becky Bruce
7651295944 powerpc: Only define HAVE_ARCH_HUGETLB_UNMAPPED_AREA if PPC_MM_SLICES
If we don't have slices, we should be able to use the generic
hugetlb_get_unmapped_area() code

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-12-07 16:26:21 +11:00
Dan McGee
fa8cbaaf5a powerpc+sparc64/mm: Remove hack in mmap randomize layout
Since commit 8a0a9bd4db, this comment in mmap_rnd() does not
hold true as the value returned by get_random_int() will in fact be

different every single call. Remove the comment and simplify the code
back to its original desired form.

This reverts commit a5adc91a4b which is no longer necessary and
also fixes the sparc code that copied this same adjustment.

Signed-off-by: Dan McGee <dpmcgee@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-28 11:42:09 +11:00
sukadev@linux.vnet.ibm.com
1d54cf2b97 powerpc: Implement CONFIG_STRICT_DEVMEM
As described in the help text in the patch, this token restricts general
access to /dev/mem as a way of increasing the security. Specifically, access
to exclusive IOMEM and kernel RAM is denied unless CONFIG_STRICT_DEVMEM is
set to 'n'.

Implement the 'devmem_is_allowed()' interface for Powerpc. It will be
called from range_is_allowed() when userpsace attempts to access /dev/mem.

This patch is based on an earlier patch from Steve Best and with input from
Paul Mackerras and Scott Wood.

[BenH] Fixed a typo or two and removed the generic change which should
       be submitted as a separate patch

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-28 11:42:08 +11:00
Jimi Xenidis
c3dcf53a3f powerpc/icswx: Simple ACOP fault handler
This patch adds a fault handler that responds to illegal Coprocessor
types.  Currently all CTs are treated and illegal.  There are two ways
to report the fault back to the application.  If the application used
the record form ("icswx.") then the architected "reject" is emulated.
If the application did not used the record form ("icswx") then it is
selectable by config whether the failure is silent (as architected) or
a SIGILL is generated.

In all cases pr_warn() is used to log the bad CT.

Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25 14:11:28 +11:00
Jimi Xenidis
9d67028090 powerpc: Split ICSWX ACOP and PID processing
Some processors, like embedded, that already have a PID register that
is managed by the system.  This patch separates the ACOP and PID
processing into separate files so that the ACOP code can be shared.

Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25 14:11:27 +11:00
Dipankar Sarma
1c8ee73395 powerpc/numa: NUMA topology support for PowerNV
This patch adds support for numa topology on powernv platforms running
OPAL formware. It checks for the type of platform at run time and
sets the affinity form correctly so that NUMA topology can be discovered
correctly.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-08 14:51:46 +11:00
Anton Blanchard
c40dd2f766 powerpc: Add System RAM to /proc/iomem
We've resisted adding System RAM to /proc/iomem because it is
the wrong place for it. Unfortunately we continue to find tools
that rely on this behaviour so give up and add it in.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-08 14:51:46 +11:00
Linus Torvalds
32aaeffbd4 Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
  Revert "tracing: Include module.h in define_trace.h"
  irq: don't put module.h into irq.h for tracking irqgen modules.
  bluetooth: macroize two small inlines to avoid module.h
  ip_vs.h: fix implicit use of module_get/module_put from module.h
  nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
  include: replace linux/module.h with "struct module" wherever possible
  include: convert various register fcns to macros to avoid include chaining
  crypto.h: remove unused crypto_tfm_alg_modname() inline
  uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
  pm_runtime.h: explicitly requires notifier.h
  linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
  miscdevice.h: fix up implicit use of lists and types
  stop_machine.h: fix implicit use of smp.h for smp_processor_id
  of: fix implicit use of errno.h in include/linux/of.h
  of_platform.h: delete needless include <linux/module.h>
  acpi: remove module.h include from platform/aclinux.h
  miscdevice.h: delete unnecessary inclusion of module.h
  device_cgroup.h: delete needless include <linux/module.h>
  net: sch_generic remove redundant use of <linux/module.h>
  net: inet_timewait_sock doesnt need <linux/module.h>
  ...

Fix up trivial conflicts (other header files, and  removal of the ab3550 mfd driver) in
 - drivers/media/dvb/frontends/dibx000_common.c
 - drivers/media/video/{mt9m111.c,ov6650.c}
 - drivers/mfd/ab3550-core.c
 - include/linux/dmaengine.h
2011-11-06 19:44:47 -08:00
Linus Torvalds
1197ab2942 Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
  powerpc/p3060qds: Add support for P3060QDS board
  powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
  powerpc/85xx: Make kexec to interate over online cpus
  powerpc/fsl_booke: Fix comment in head_fsl_booke.S
  powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
  powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
  powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
  powerpc/86xx: Correct Gianfar support for GE boards
  powerpc/cpm: Clear muram before it is in use.
  drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
  powerpc/fsl_msi: add support for "msi-address-64" property
  powerpc/85xx: Setup secondary cores PIR with hard SMP id
  powerpc/fsl-booke: Fix settlbcam for 64-bit
  powerpc/85xx: Adding DCSR node to dtsi device trees
  powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
  powerpc/85xx: fix PHYS_64BIT selection for P1022DS
  powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
  powerpc: respect mem= setting for early memory limit setup
  powerpc: Update corenet64_smp_defconfig
  powerpc: Update mpc85xx/corenet 32-bit defconfigs
  ...

Fix up trivial conflicts in:
 - arch/powerpc/configs/40x/hcu4_defconfig
	removed stale file, edited elsewhere
 - arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
	added opal and gelic drivers vs added ePAPR driver
 - drivers/tty/serial/8250.c
	moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support
2011-11-06 17:12:03 -08:00
Andrea Arcangeli
b35a35b556 thp: share get_huge_page_tail()
This avoids duplicating the function in every arch gup_fast.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:58 -07:00
Andrea Arcangeli
cf592bf768 powerpc: gup_huge_pmd() return 0 if pte changes
powerpc didn't return 0 in that case, if it's rolling back the *nr pointer
it should also return zero to avoid adding pages to the array at the wrong
offset.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Andrea Arcangeli
3526741f09 powerpc: gup_hugepte() support THP based tail recounting
Up to this point the code assumed old refcounting for hugepages (pre-thp).
This updates the code directly to the thp mapcount tail page refcounting.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Andrea Arcangeli
8596468487 powerpc: gup_hugepte() avoid freeing the head page too many times
We only taken "refs" pins on the head page not "*nr" pins.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Andrea Arcangeli
405e44f2e3 powerpc: get_hugepte() don't put_page() the wrong page
"page" may have changed to point to the next hugepage after the loop
completed, The references have been taken on the head page, so the
put_page must happen there too.

This is a longstanding issue pre-thp inclusion.

It's totally unclear how these page_cache_add_speculative and
pte_val(pte) != pte_val(*ptep) checks are necessary across all the
powerpc gup_fast code, when x86 doesn't need any of that: there's no way
the page can be freed with irq disabled so we're guaranteed the
atomic_inc will happen on a page with page_count > 0 (so not needing the
speculative check).

The pte check is also meaningless on x86: no need to rollback on x86 if
the pte changed, because the pte can still change a CPU tick after the
check succeeded and it won't be rolled back in that case.  The important
thing is we got a reference on a valid page that was mapped there a CPU
tick ago.  So not knowing the soft tlb refill code of ppc64 in great
detail I'm not removing the "speculative" page_count increase and the
pte checks across all the code, but unless there's a strong reason for
it they should be later cleaned up too.

If a pte can change from huge to non-huge (like it could happen with
THP) passing a pte_t *ptep to gup_hugepte() would also require to repeat
the is_hugepd in gup_hugepte(), but that shouldn't happen with hugetlbfs
only so I'm not altering that.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Andrea Arcangeli
2839bdc1bf powerpc: remove superfluous PageTail checks on the pte gup_fast
This part of gup_fast doesn't seem capable of handling hugetlbfs ptes,
those should be handled by gup_hugepd only, so these checks are
superfluous.

Plus if this wasn't a noop, it would have oopsed because, the insistence
of using the speculative refcounting would trigger a VM_BUG_ON if a tail
page was encountered in the page_cache_get_speculative().

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Andrea Arcangeli
70b50f94f1 mm: thp: tail page refcounting fix
Michel while working on the working set estimation code, noticed that
calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
wasn't safe, if the pfn ended up being a tail page of a transparent
hugepage under splitting by __split_huge_page_refcount().

He then found the problem could also theoretically materialize with
page_cache_get_speculative() during the speculative radix tree lookups
that uses get_page_unless_zero() in SMP if the radix tree page is freed
and reallocated and get_user_pages is called on it before
page_cache_get_speculative has a chance to call get_page_unless_zero().

So the best way to fix the problem is to keep page_tail->_count zero at
all times.  This will guarantee that get_page_unless_zero() can never
succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
is unused for all tail pages of a compound page, so we can simply
account the tail page references there and transfer them to
tail_page->_count in __split_huge_page_refcount() (in addition to the
head_page->_mapcount).

While debugging this s/_count/_mapcount/ change I also noticed get_page is
called by direct-io.c on pages returned by get_user_pages.  That wasn't
entirely safe because the two atomic_inc in get_page weren't atomic.  As
opposed to other get_user_page users like secondary-MMU page fault to
establish the shadow pagetables would never call any superflous get_page
after get_user_page returns.  It's safer to make get_page universally safe
for tail pages and to use get_page_foll() within follow_page (inside
get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
pages without taking any locks because it is run within PT lock protected
critical sections (PT lock for pte and page_table_lock for
pmd_trans_huge).

The standard get_page() as invoked by direct-io instead will now take
the compound_lock but still only for tail pages.  The direct-io paths
are usually I/O bound and the compound_lock is per THP so very
finegrined, so there's no risk of scalability issues with it.  A simple
direct-io benchmarks with all lockdep prove locking and spinlock
debugging infrastructure enabled shows identical performance and no
overhead.  So it's worth it.  Ideally direct-io should stop calling
get_page() on pages returned by get_user_pages().  The spinlock in
get_page() is already optimized away for no-THP builds but doing
get_page() on tail pages returned by GUP is generally a rare operation
and usually only run in I/O paths.

This new refcounting on page_tail->_mapcount in addition to avoiding new
RCU critical sections will also allow the working set estimation code to
work without any further complexity associated to the tail page
refcounting with THP.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Michel Lespinasse <walken@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02 16:06:57 -07:00
Paul Gortmaker
4b16f8e2d6 powerpc: various straight conversions from module.h --> export.h
All these files were including module.h just for the basic
EXPORT_SYMBOL infrastructure.  We can shift them off to the
export.h header which is a way smaller footprint and thus
realize some compile time gains.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:44 -04:00
Paul Gortmaker
9308794884 powerpc: include export.h for files using EXPORT_SYMBOL/THIS_MODULE
Fix failures in powerpc associated with the previously allowed
implicit module.h presence that now lead to things like this:

arch/powerpc/mm/mmu_context_hash32.c:76:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/powerpc/mm/tlb_hash32.c:48:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/kernel/pci_32.c:51:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
arch/powerpc/kernel/iomap.c:36:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/platforms/44x/canyonlands.c:126:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
arch/powerpc/kvm/44x.c:168:59: error: 'THIS_MODULE' undeclared (first use in this function)

[with several contibutions from Stephen Rothwell <sfr@canb.auug.org.au>]

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:38 -04:00
Paul Gortmaker
66b15db69c powerpc: add export.h to files making use of EXPORT_SYMBOL
With module.h being implicitly everywhere via device.h, the absence
of explicitly including something for EXPORT_SYMBOL went unnoticed.
Since we are heading to fix things up and clean module.h from the
device.h file, we need to explicitly include these files now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-31 19:30:37 -04:00
Becky Bruce
4559424a0c powerpc/fsl-booke: Fix settlbcam for 64-bit
Currently, it does a cntlzd on the size and then subtracts it from
21.... this doesn't take into account the varying size of a "long".
Just use __ilog instead (and subtract the 10 we have to subtract
to get to the tsize encoding).

Also correct the comment about page sizes supported.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2011-10-12 23:39:10 -05:00
Kumar Gala
1dc91c3eb3 powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
On FSL Book-E devices we support multiple large TLB sizes and so we can
get into situations in which the initial 1G TLB size is too big and
we're asked for a size that is not mappable by a single entry (like
512M).  The single entry is important because when we bring up secondary
cores they need to ensure any data structure they need to access (eg
PACA or stack) is always mapped.

So we really need to determine what size will actually be mapped by the
first TLB entry to ensure we limit early memory references to that
region.  We refactor the map_mem_in_cams() code to provider a helper
function that we can utilize to determine the size of the first TLB
entry while taking into account size and alignment constraints.

Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2011-10-11 23:30:41 -05:00
Paul Mackerras
25c29f9e32 powerpc: Fix hugetlb with CONFIG_PPC_MM_SLICES=y
Commit 41151e77a4 ("powerpc: Hugetlb for BookE") added some
#ifdef CONFIG_MM_SLICES conditionals to hugetlb_get_unmapped_area()
and vma_mmu_pagesize().  Unfortunately this is not the correct config
symbol; it should be CONFIG_PPC_MM_SLICES.  The result is that
attempting to use hugetlbfs on 64-bit Power server processors results
in an infinite stack recursion between get_unmapped_area() and
hugetlb_get_unmapped_area().

This fixes it by changing the #ifdef to use CONFIG_PPC_MM_SLICES
in those functions and also in book3e_hugetlb_preload().

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-23 10:21:33 +10:00
Anton Blanchard
8bdafa39a4 powerpc: Fix deadlock in icswx code
The icswx code introduced an A-B B-A deadlock:

     CPU0                    CPU1
     ----                    ----
lock(&anon_vma->mutex);
                             lock(&mm->mmap_sem);
                             lock(&anon_vma->mutex);
lock(&mm->mmap_sem);

Instead of using the mmap_sem to keep mm_users constant, take the
page table spinlock.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 15:53:23 +10:00
Anton Blanchard
a11940978b powerpc: Fix oops when echoing bad values to /sys/devices/system/memory/probe
If we echo an address the hypervisor doesn't like to
/sys/devices/system/memory/probe we oops the box:

# echo 0x10000000000 > /sys/devices/system/memory/probe

kernel BUG at arch/powerpc/mm/hash_utils_64.c:541!

The backtrace is:

create_section_mapping
arch_add_memory
add_memory
memory_probe_store
sysdev_class_store
sysfs_write_file
vfs_write
SyS_write

In create_section_mapping we BUG if htab_bolt_mapping returned
an error. A better approach is to return an error which will
propagate back to userspace.

Rerunning the test with this patch applied:

# echo 0x10000000000 > /sys/devices/system/memory/probe
-bash: echo: write error: Invalid argument

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 15:53:23 +10:00
Anton Blanchard
dfbe93a222 powerpc: Coding style cleanups
While converting code to use for_each_node_by_type I noticed a
number of coding style issues.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 15:53:23 +10:00
Anton Blanchard
94db7c5e14 powerpc: Use for_each_node_by_type instead of open coding it
Use for_each_node_by_type instead of open coding it.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 15:53:23 +10:00
Anton Blanchard
6083184269 powerpc/numa: Remove double of_node_put in hot_add_node_scn_to_nid
During memory hotplug testing, I got the following warning:

ERROR: Bad of_node_put() on /memory@0

of_node_release
kref_put
of_node_put
of_find_node_by_type
hot_add_node_scn_to_nid
hot_add_scn_to_nid
memory_add_physaddr_to_nid
...

of_find_node_by_type() loop does the of_node_put for us so we only
need the handle the case where we terminate the loop early.

As suggested by Stephen Rothwell we can do the of_node_put
unconditionally outside of the loop since of_node_put handles a
NULL argument fine.

Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 15:53:22 +10:00
Becky Bruce
41151e77a4 powerpc: Hugetlb for BookE
Enable hugepages on Freescale BookE processors.  This allows the kernel to
use huge TLB entries to map pages, which can greatly reduce the number of
TLB misses and the amount of TLB thrashing experienced by applications with
large memory footprints.  Care should be taken when using this on FSL
processors, as the number of large TLB entries supported by the core is low
(16-64) on current processors.

The supported set of hugepage sizes include 4m, 16m, 64m, 256m, and 1g.
Page sizes larger than the max zone size are called "gigantic" pages and
must be allocated on the command line (and cannot be deallocated).

This is currently only fully implemented for Freescale 32-bit BookE
processors, but there is some infrastructure in the code for
64-bit BooKE.

Signed-off-by: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-09-20 09:19:40 +10:00
Linus Torvalds
184475029a Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (99 commits)
  drivers/virt: add missing linux/interrupt.h to fsl_hypervisor.c
  powerpc/85xx: fix mpic configuration in CAMP mode
  powerpc: Copy back TIF flags on return from softirq stack
  powerpc/64: Make server perfmon only built on ppc64 server devices
  powerpc/pseries: Fix hvc_vio.c build due to recent changes
  powerpc: Exporting boot_cpuid_phys
  powerpc: Add CFAR to oops output
  hvc_console: Add kdb support
  powerpc/pseries: Fix hvterm_raw_get_chars to accept < 16 chars, fixing xmon
  powerpc/irq: Quieten irq mapping printks
  powerpc: Enable lockup and hung task detectors in pseries and ppc64 defeconfigs
  powerpc: Add mpt2sas driver to pseries and ppc64 defconfig
  powerpc: Disable IRQs off tracer in ppc64 defconfig
  powerpc: Sync pseries and ppc64 defconfigs
  powerpc/pseries/hvconsole: Fix dropped console output
  hvc_console: Improve tty/console put_chars handling
  powerpc/kdump: Fix timeout in crash_kexec_wait_realmode
  powerpc/mm: Fix output of total_ram.
  powerpc/cpufreq: Add cpufreq driver for Momentum Maple boards
  powerpc: Correct annotations of pmu registration functions
  ...

Fix up trivial Kconfig/Makefile conflicts in arch/powerpc, drivers, and
drivers/cpufreq
2011-07-25 22:59:39 -07:00
Linus Torvalds
5fabc487c9 Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  KVM: IOMMU: Disable device assignment without interrupt remapping
  KVM: MMU: trace mmio page fault
  KVM: MMU: mmio page fault support
  KVM: MMU: reorganize struct kvm_shadow_walk_iterator
  KVM: MMU: lockless walking shadow page table
  KVM: MMU: do not need atomicly to set/clear spte
  KVM: MMU: introduce the rules to modify shadow page table
  KVM: MMU: abstract some functions to handle fault pfn
  KVM: MMU: filter out the mmio pfn from the fault pfn
  KVM: MMU: remove bypass_guest_pf
  KVM: MMU: split kvm_mmu_free_page
  KVM: MMU: count used shadow pages on prepareing path
  KVM: MMU: rename 'pt_write' to 'emulate'
  KVM: MMU: cleanup for FNAME(fetch)
  KVM: MMU: optimize to handle dirty bit
  KVM: MMU: cache mmio info on page fault path
  KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
  KVM: MMU: do not update slot bitmap if spte is nonpresent
  KVM: MMU: fix walking shadow page table
  KVM guest: KVM Steal time registration
  ...
2011-07-24 09:07:03 -07:00
Linus Torvalds
4d4abdcb1d Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (123 commits)
  perf: Remove the nmi parameter from the oprofile_perf backend
  x86, perf: Make copy_from_user_nmi() a library function
  perf: Remove perf_event_attr::type check
  x86, perf: P4 PMU - Fix typos in comments and style cleanup
  perf tools: Make test use the preset debugfs path
  perf tools: Add automated tests for events parsing
  perf tools: De-opt the parse_events function
  perf script: Fix display of IP address for non-callchain path
  perf tools: Fix endian conversion reading event attr from file header
  perf tools: Add missing 'node' alias to the hw_cache[] array
  perf probe: Support adding probes on offline kernel modules
  perf probe: Add probed module in front of function
  perf probe: Introduce debuginfo to encapsulate dwarf information
  perf-probe: Move dwarf library routines to dwarf-aux.{c, h}
  perf probe: Remove redundant dwarf functions
  perf probe: Move strtailcmp to string.c
  perf probe: Rename DIE_FIND_CB_FOUND to DIE_FIND_CB_END
  tracing/kprobe: Update symbol reference when loading module
  tracing/kprobes: Support module init function probing
  kprobes: Return -ENOENT if probe point doesn't exist
  ...
2011-07-22 16:44:39 -07:00
Benjamin Herrenschmidt
4b575f3e8a Merge remote-tracking branch 'jwb/next' into next 2011-07-22 13:16:41 +10:00
Tony Breeds
f7ba2991e9 powerpc/mm: Fix output of total_ram.
On 32bit platforms that support >= 4GB memory total_ram was truncated.
This creates a confusing printk:
	Top of RAM: 0x100000000, Total RAM: 0x0
Fix that:
	Top of RAM: 0x100000000, Total RAM: 0x100000000

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-07-19 15:13:04 +10:00
Dave Kleikamp
9661534d6a powerpc/47x: allow kernel to be loaded in higher physical memory
The 44x code (which is shared by 47x) assumes the available physical memory
begins at 0x00000000.  This is not necessarily the case in an AMP
environment.

Support CONFIG_RELOCATABLE for 476 in order to allow the kernel to be
loaded into a higher memory range.

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
2011-07-12 10:34:24 -04:00
Dave Kleikamp
91b191c71e powerpc/44x: don't use tlbivax on AMP systems
Since other OS's may be running on the other cores don't use tlbivax

Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Josh Boyer <jwboyer@linux.vnet.ibm.com>
2011-07-12 09:21:55 -04:00
Paul Mackerras
9e368f2915 KVM: PPC: book3s_hv: Add support for PPC970-family processors
This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode.  Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.

There are several differences between the PPC970 and POWER7 in how
guests are managed.  These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits.  Notably, on PPC970:

* The LPCR, LPID or RMOR registers don't exist, and the functions of
  those registers are provided by bits in HID4 and one bit in HID0.

* External interrupts can be directed to the hypervisor, but unlike
  POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
  SRR0/1 not HSRR0/1.

* There is no virtual RMA (VRMA) mode; the guest must use an RMO
  (real mode offset) area.

* The TLB entries are not tagged with the LPID, so it is necessary to
  flush the whole TLB on partition switch.  Furthermore, when switching
  partitions we have to ensure that no other CPU is executing the tlbie
  or tlbsync instructions in either the old or the new partition,
  otherwise undefined behaviour can occur.

* The PMU has 8 counters (PMC registers) rather than 6.

* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.

* The SLB has 64 entries rather than 32.

* There is no mediated external interrupt facility, so if we switch to
  a guest that has a virtual external interrupt pending but the guest
  has MSR[EE] = 0, we have to arrange to have an interrupt pending for
  it so that we can get control back once it re-enables interrupts.  We
  do that by sending ourselves an IPI with smp_send_reschedule after
  hard-disabling interrupts.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-12 13:16:59 +03:00
Paul Mackerras
969391c58a powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits
This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06.  We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.

Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0).  On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.

Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like

END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)

The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.

Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.

Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
2011-07-12 13:16:58 +03:00