To avoid a race between rmap walk and mremap, mremap does
take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss
a page table entry due to PTE moves via move_pagetables(). The kernel
does further optimization of this lock such that if we are going to find
the newly added vma after the old vma, the rmap lock is not taken. This
is because rmap walk would find the vmas in the same order and if we don't
find the page table attached to older vma we would find it with the new
vma which we would iterate later.
As explained in commit eb66ae0308 ("mremap: properly flush TLB before
releasing the page") mremap is special in that it doesn't take ownership
of the page. The optimized version for PUD/PMD aligned mremap also
doesn't hold the ptl lock. This can result in stale TLB entries as show
below.
This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::
Optmized PMD move
CPU 1 CPU 2 CPU 3
mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one
mmap_write_lock_killable()
addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)
*new_pmd = pmd
*new_addr = 10; and fills
TLB with new addr
and old pfn
unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale TLB entry
Optimized PUD move also suffers from a similar race. Both the above race
condition can be fixed if we force mremap path to take rmap lock.
Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com
Fixes: 2c91bd4a4e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 97113eb39fa7972722ff490b947d8af023e1f6a2)
[Kalesh Singh: Resolve some trivial conflicts in mm/mremap.c]
Bug: 151772539
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I5b7235e982ea2efdc155018271fbaf2711fac4c1
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
source and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is
approximately a 13x improvement in performance on x86. (See data
below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping
a PUD-aligned, 1GB sized region to a PUD-aligned destination.
The results from 10 iterations of the test are given below:
Total mremap times for 1GB data on x86. All times are in nanoseconds.
Control HAVE_MOVE_PUD
180394 15089
235728 14056
238931 25741
187330 13838
241742 14187
177925 14778
182758 14728
160872 14418
205813 15107
245722 13998
205721.5 15594 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~205 microseconds
to ~15 microseconds on x86. (~13x speed up).
Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Hassan Naveed <hnaveed@wavecomp.com>
Cc: Jia He <justin.he@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit be37c98d1134a8e068b52618c086dab6b34b9a2c)
Bug: 151772539
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I7967951289885157ef3a487a6935abe3b860847f
HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source
and destination addresses are PUD-aligned.
With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
a 19x improvement in performance on arm64. (See data below).
------- Test Results ---------
The following results were obtained using a 5.4 kernel, by remapping a
PUD-aligned, 1GB sized region to a PUD-aligned destination. The results
from 10 iterations of the test are given below:
Total mremap times for 1GB data on arm64. All times are in nanoseconds.
Control HAVE_MOVE_PUD
1247761 74271
1219896 46771
1094792 59687
1227760 48385
1043698 76666
1101771 50365
1159896 52500
1143594 75261
1025833 61354
1078125 48697
1134312.6 59395.7 <-- Mean time in nanoseconds
A 1GB mremap completion time drops from ~1.1 milliseconds to ~59
microseconds on arm64. (~19x speed up).
Link: https://lkml.kernel.org/r/20201014005320.2233162-5-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Hassan Naveed <hnaveed@wavecomp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f5308c896d5de211245a9dc73b4e530f75185dd5)
Bug: 151772539
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: I30590b7375dde84fe345f4920a06f6a2c0b5aa31
Android needs to move large memory regions for garbage collection. The GC
requires moving physical pages of multi-gigabyte heap using mremap.
During this move, the application threads have to be paused for
correctness. It is critical to keep this pause as short as possible to
avoid jitters during user interaction.
Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
the source and destination addresses are PUD-aligned. For
CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
entries, since the PUD entry is “folded back” onto the PGD entry. Add
HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
supported/tested can turn this off by not selecting the config.
Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Hassan Naveed <hnaveed@wavecomp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Krzysztof Kozlowski <krzk@kernel.org>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Minchan Kim <minchan@google.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Sami Tolvanen <samitolvanen@google.com>
Cc: Sandipan Das <sandipan@linux.ibm.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c49dd340180260c6239e453263a9a244da9a7c85)
[Kalesh Singh: Resolve conflicts in mm/mremap.c]
Bug: 151772539
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: Ia9b065f5059044815fd05f22bad33c484b2b2b73
HAVE_MOVE_PMD enables remapping pages at the PMD level if both the
source and destination addresses are PMD-aligned.
HAVE_MOVE_PMD is already enabled on x86. The original patch [1] that
introduced this config did not enable it on arm64 at the time because
of performance issues with flushing the TLB on every PMD move. These
issues have since been addressed in more recent releases with
improvements to the arm64 TLB invalidation and core mmu_gather code as
Will Deacon mentioned in [2].
>From the data below, it can be inferred that there is approximately
8x improvement in performance when HAVE_MOVE_PMD is enabled on arm64.
--------- Test Results ----------
The following results were obtained on an arm64 device running a 5.4
kernel, by remapping a PMD-aligned, 1GB sized region to a PMD-aligned
destination. The results from 10 iterations of the test are given below.
All times are in nanoseconds.
Control HAVE_MOVE_PMD
9220833 1247761
9002552 1219896
9254115 1094792
8725885 1227760
9308646 1043698
9001667 1101771
8793385 1159896
8774636 1143594
9553125 1025833
9374010 1078125
9100885.4 1134312.6 <-- Mean Time in nanoseconds
Total mremap time for a 1GB sized PMD-aligned region drops from
~9.1 milliseconds to ~1.1 milliseconds. (~8x speedup).
[1] https://lore.kernel.org/r/20181108181201.88826-3-joelaf@google.com
[2] https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg140837.html
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: https://lore.kernel.org/r/20201014005320.2233162-3-kaleshsingh@google.com
Link: https://lore.kernel.org/kvmarm/20181029102840.GC13965@arm.com/
Signed-off-by: Will Deacon <will@kernel.org>
(cherry picked from commit 45544eee96065cf183fbb937fe1f45a172b06f4e)
Bug: 151772539
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Change-Id: If0d97276cf8de1a5893e97444f2d961db05abea5
Commit e7e832ce6fa7 ("fs: add LSM-supporting anon-inode interface") adds
more kerneldoc description, but also a few new warnings on
anon_inode_getfd_secure() due to missing parameter descriptions.
Rephrase to appropriate kernel-doc for anon_inode_getfd_secure().
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit 365982aba1f264dba26f0908700d62bfa046918c)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: Ie7837f21dfe28c03594ebc65fd293c00c57ba5c5
This change gives userfaultfd file descriptors a real security
context, allowing policy to act on them.
Signed-off-by: Daniel Colascione <dancol@google.com>
[LG: Remove owner inode from userfaultfd_ctx]
[LG: Use anon_inode_getfd_secure() in userfaultfd syscall]
[LG: Use inode of file in userfaultfd_read() in resolve_userfault_fork()]
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit b537900f1598b67bcb8acac20da73c6e26ebbf99)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: Ifd3faca4058bd9e4c51767aa0246e1c53ad410d4
This change uses the anon_inodes and LSM infrastructure introduced in
the previous patches to give SELinux the ability to control
anonymous-inode files that are created using the new
anon_inode_getfd_secure() function.
A SELinux policy author detects and controls these anonymous inodes by
adding a name-based type_transition rule that assigns a new security
type to anonymous-inode files created in some domain. The name used
for the name-based transition is the name associated with the
anonymous inode for file listings --- e.g., "[userfaultfd]" or
"[perf_event]".
Example:
type uffd_t;
type_transition sysadm_t sysadm_t : anon_inode uffd_t "[userfaultfd]";
allow sysadm_t uffd_t:anon_inode { create };
(The next patch in this series is necessary for making userfaultfd
support this new interface. The example above is just
for exposition.)
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit 29cd6591ab6fee3125ea5c1bf350f5013bc615e1)
Conflicts:
security/selinux/include/classmap.h
Compile errors:
security/selinux/hooks.c
(1. Removed 'lockdown' mapping to be in sync with d9cb255af3a03d7b9cdb5ddbab10d9f5c68f97f2)
(2. Replace usage of selinux_initialized() with
selinux_state.initialized)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: I85df2757f121cd7072e91cf3b93c09657bd36b76
This change adds a new function, anon_inode_getfd_secure, that creates
anonymous-node file with individual non-S_PRIVATE inode to which security
modules can apply policy. Existing callers continue using the original
singleton-inode kind of anonymous-inode file. We can transition anonymous
inode users to the new kind of anonymous inode in individual patches for
the sake of bisection and review.
The new function accepts an optional context_inode parameter that callers
can use to provide additional contextual information to security modules.
For example, in case of userfaultfd, the created inode is a 'logical child'
of the context_inode (userfaultfd inode of the parent process) in the sense
that it provides the security context required during creation of the child
process' userfaultfd inode.
Signed-off-by: Daniel Colascione <dancol@google.com>
[LG: Delete obsolete comments to alloc_anon_inode()]
[LG: Add context_inode description in comments to anon_inode_getfd_secure()]
[LG: Remove definition of anon_inode_getfile_secure() as there are no callers]
[LG: Make __anon_inode_getfile() static]
[LG: Use correct error cast in __anon_inode_getfile()]
[LG: Fix error handling in __anon_inode_getfile()]
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit e7e832ce6fa769f800cd7eaebdb0459ad31e0416)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: I88f1821243c58454ce445fd50fd804221e0bfc67
This change adds a new LSM hook, inode_init_security_anon(), that will
be used while creating secure anonymous inodes. The hook allows/denies
its creation and assigns a security context to the inode.
The new hook accepts an optional context_inode parameter that callers
can use to provide additional contextual information to security modules
for granting/denying permission to create an anon-inode of the same type.
This context_inode's security_context can also be used to initialize the
newly created anon-inode's security_context.
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
(cherry picked from commit 215b674b84dd052098fe6389e32a5afaff8b4d56)
Conflicts:
include/linux/lsm_hook_defs.h
(1. Added LSM hook in lsm_hook.h and removd lsm_hook_defs.h as per
98e828a0650f348be85728c69875260cf78069e6, which is not merged here)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: I83fe318c891f034b4dd7f3f357cc74964b55ffc8
With this change, when the knob is set to 0, it allows unprivileged users
to call userfaultfd, like when it is set to 1, but with the restriction
that page faults from only user-mode can be handled. In this mode, an
unprivileged user (without SYS_CAP_PTRACE capability) must pass
UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.
This enables administrators to reduce the likelihood that an attacker with
access to userfaultfd can delay faulting kernel code to widen timing
windows for other exploits.
The default value of this knob is changed to 0. This is required for
correct functioning of pipe mutex. However, this will fail postcopy live
migration, which will be unnoticeable to the VM guests. To avoid this,
set 'vm.userfault = 1' in /sys/sysctl.conf.
The main reason this change is desirable as in the short term is that the
Android userland will behave as with the sysctl set to zero. So without
this commit, any Linux binary using userfaultfd to manage its memory would
behave differently if run within the Android userland. For more details,
refer to Andrea's reply [1].
[1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: <calin@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d0d4730ac2e404a5b0da9a87ef38c73e51cb1664)
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Bug: 160737021
Bug: 169683130
Change-Id: Ic46c0be47d6394d25bd3443ff524936fa568ab85
Patch series "Control over userfaultfd kernel-fault handling", v6.
This patch series is split from [1]. The other series enables SELinux
support for userfaultfd file descriptors so that its creation and movement
can be controlled.
It has been demonstrated on various occasions that suspending kernel code
execution for an arbitrary amount of time at any access to userspace
memory (copy_from_user()/copy_to_user()/...) can be exploited to change
the intended behavior of the kernel. For instance, handling page faults
in kernel-mode using userfaultfd has been exploited in [2, 3]. Likewise,
FUSE, which is similar to userfaultfd in this respect, has been exploited
in [4, 5] for similar outcome.
This small patch series adds a new flag to userfaultfd(2) that allows
callers to give up the ability to handle kernel-mode faults with the
resulting UFFD file object. It then adds a 'user-mode only' option to the
unprivileged_userfaultfd sysctl knob to require unprivileged callers to
use this new flag.
The purpose of this new interface is to decrease the chance of an
unprivileged userfaultfd user taking advantage of userfaultfd to enhance
security vulnerabilities by lengthening the race window in kernel code.
[1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
[2] https://duasynt.com/blog/linux-kernel-heap-spray
[3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
[4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
[5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
This patch (of 2):
userfaultfd handles page faults from both user and kernel code. Add a new
UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
userfaultfd object refuse to handle faults from kernel mode, treating
these faults as if SIGBUS were always raised, causing the kernel code to
fail with EFAULT.
A future patch adds a knob allowing administrators to give some processes
the ability to create userfaultfd file objects only if they pass
UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
exploit userfaultfd's ability to delay kernel page faults to open timing
windows for future exploits.
Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <calin@google.com>
Cc: Daniel Colascione <dancol@dancol.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Nitin Gupta <nigupta@nvidia.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 37cd0575b8510159992d279c530c05f872990b02)
Conflicts:
include/uapi/linux/userfaultfd.h
(1. Removed uffdio_writeprotect struct part of 63b2d4174c4ad)
Bug: 160737021
Bug: 169683130
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Change-Id: Ic4cecce08956402f91f866ae08f1d4ec21d64289
To be able to update addresses of an IPsec SA, as required by
supporting MOBIKE
Bug: 169169084
Signed-off-by: Yan Yan <evitayan@google.com>
Change-Id: I5aa3f3556d615e4f0695bb78cd3cad9e83851df5
This partially reverts commit 2e77a46f0d.
Reason for revert: ipa3_mdt_load_ipa_fws is not present in this kernel,
this is a bad cherrypick resolution by qcom.
Change-Id: I218978168a6c3fe9f6139a6688be1cb4c0f96b94
"LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0"
* tag 'LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0' of https://git.codelinaro.org/clo/la/platform/vendor/opensource/audio-kernel:
asoc: codec: avoid out of bound write to map array
asoc: codec: avoid out of bound write to map array
asoc: Fixed OOB issue in qcs405
asoc: codec: wcd934x: enable auto recovery when port overflows
Change-Id: I51a30fe905251b6f66e733bae43fdcd3b0a7e787
"LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0"
* tag 'LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0' of https://git.codelinaro.org/clo/la/platform/vendor/qcom-opensource/wlan/qcacld-3.0:
qcacld-3.0: Fix the possible OOB write in country IE unpack
qcacld-3.0: Correcting the TSInfo structure size according to the Spec
Release 2.0.8.34Y
qcacld-3.0: Correcting the TSInfo structure size according to the Spec
Release 2.0.8.34X
qcacld-3.0: Remove use-after-free of frame in tx mgmt send
Release 2.0.8.34W
qcacld-3.0: Fix the possible OOB write in country IE unpack
Release 2.0.8.34V
qcacld-3.0: Enhance the RSNXE inter-op logic
Release 2.0.8.34U
qcacld-3.0: Set sar safety req resp event before unsolited work stop
Release 2.0.8.34T
qcacld-3.0: Update connect request crypto parameters
qcacld-3.0: Enable CFG80211_MULTI_AKM_CONNECT_SUPPORT from kernelv6.0
qcacld-3.0: Update wiphy max_num_akms_connect variable
Change-Id: I79d549c4c8c2f38dc18509f91f12befc15bb19c6
"LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0"
* tag 'LA.UM.9.14.r1-25800-LAHAINA.QSSI15.0' of https://git.codelinaro.org/clo/la/kernel/msm-5.4:
msm: eva: Validating the SFR buffer size before accessing
msm: eva: Copy back the validated size to avoid security issue
msm: npu: Fix use after free issue
USB: dwc3: gadget: Add stop transfer request for isoc transfers
arm64: defconfig: Enable uvc for QCM6490 IOT target
firmware: qcom_scm: do not clear dump mode from shutdown
msm: virtio_npu: Fix use-after-free issue in unmap_buf
msm: virtio_npu: Fix use-after-free issue in virt_npu_map_buf
i2c: i2c-master-msm-geni: add null pointer check in event call back
firmware: qcom_scm: handle echo b > /proc/sysrq-trigger
scripts: mod: replace with a safe function
msm: ep_pcie: Disable hot reset and ignore linkdown
coresight-tmc: Replace deprecated function
USB: dwc3: gadget: Queue data for 16 micro frames ahead in future
power: reset: Disable support of dynamic download mode (ramdump)
Conflicts:
arch/arm64/boot/dts/vendor/bindings/sound/rt5645.txt
Change-Id: I57c063465c2804c77c5a6f62acb6c7987a38bc7f
https://source.android.com/docs/security/bulletin/2025-02-01
CVE-2024-53104
CVE-2025-0088
* tag 'ASB-2025-02-05_11-5.4' of https://android.googlesource.com/kernel/common: (449 commits)
ANDROID: gki - change networking configuration
ANDROID: kernelci build-break for 64-bit riscv clang builds (5.4 only)
Revert "BACKPORT: RISC-V: Stop relying on GCC's register allocator's hueristics"
Revert "ANDROID: declare sp_in_global outside of CONFIG_FRAME_POINTER"
ANDROID: GKI: add Trimble symbol list
UPSTREAM: selinux: ignore unknown extended permissions
ANDROID: ABI: Update allowed list for galaxy
Revert "netfilter: Replace zero-length array with flexible-array member"
Revert "tracing: Constify string literal data member in struct trace_event_call"
Revert "skb_expand_head() adjust skb->truesize incorrectly"
Linux 5.4.289
ftrace: use preempt_enable/disable notrace macros to avoid double fault
mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
drm: adv7511: Drop dsi single lane support
net/sctp: Prevent autoclose integer overflow in sctp_association_init()
sky2: Add device ID 11ab:4373 for Marvell 88E8075
pinctrl: mcp23s08: Fix sleeping in atomic context due to regmap locking
RDMA/uverbs: Prevent integer overflow issue
modpost: fix the missed iteration for the max bit in do_input()
modpost: fix input MODULE_DEVICE_TABLE() built for 64-bit on 32-bit host
...
Conflicts:
arch/arm64/boot/dts/vendor/bindings/clock/adi,axi-clkgen.yaml
arch/arm64/boot/dts/vendor/bindings/clock/axi-clkgen.txt
drivers/rpmsg/qcom_glink_native.c
drivers/soc/qcom/socinfo.c
Change-Id: I60727e0cdd974fda5ca71f938bc2f984a8bbf19a
When STACKPROTECTOR_PER_TASK is enabled, __stack_chk_guard is not
exported which breaks external modules due to missing symbols.
Change-Id: I633dcdd3343f63f6af34aab87e98cc81c6c56fe2
- Proper Handling in case of invalid pinctrl index
- Removing dead code and unused variables
- Change to dereference s_ctrl only after proper
NULL Dereference Check.
CRs-Fixed: 3875406
Change-Id: I8e2c717b22efff2a7d6503d38c048e30eff230da
Signed-off-by: Swami Reddy Reddy <quic_swamired@quicinc.com>
Make the following configuration changes:
CONFIG_NET_CLS_MATCHALL=y
CONFIG_NET_ACT_POLICE=y
CONFIG_NET_ACT_BPF=y
CONFIG_USB_RTL8150=y
CONFIG_USB_NET_CDC_EEM=y
CONFIG_USB_NET_CDC_NCM=y
CONFIG_USB_NET_AQC111=y
Note: 4 of these are already enabled in android12-5.4,
and the remaining 3 will be enabled as well in
https://android-review.googlesource.com/c/kernel/common/+/3470489
Test: TreeHugger
Bug: 377436524
Bug: 391669319
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Change-Id: I1737c1e18a072cac9dd50e4d21b52b220c2ec40c
No 64-bit riscv builds were working with clang (found via kernelci) for our 5.4 kernels:
ld.lld: error: arch/riscv/built-in.a(kernel/signal.o):(function do_notify_resume: .text+0x30c): relocation R_RISCV_PCREL_HI20 out of range: 33554406 is not in [-524288, 524287]; references '__vdso_rt_sigreturn'
>>> referenced by signal.c
>>> defined in arch/riscv/built-in.a(kernel/vdso/vdso-syms.o)
There are two ANDROID-specific patches that added global variables that
must be reverted to fix the build:
f06e9ec979 ("BACKPORT: RISC-V: Stop relying on GCC's register allocator's hueristics")
3818981c3332 ("ANDROID: declare sp_in_global outside of CONFIG_FRAME_POINTER")
Also, CONFIG_INIT_STACK_ALL_PATTERN must be disabled to avoid the
relocation issues.
These issues all seem to be fixed in 5.10 and later kernels.
Bug: 393656515
Signed-off-by: Todd Kjos <tkjos@google.com>
Change-Id: I0a5395a9767b94ec2291c9ef7e9a69f1f4665730
This reverts commit 6dea6a501418e1561d48f141d288866d71684372.
Reason for revert: Prevents device from entering deep sleep (possibly
improper port).
NOTE: dwc3_msm_role_allowed was added in a later change, specifically
"usb: dwc3: dwc3-msm-core: Reject incompatible role/mode request", but
it is necessary to modify it here due to that change's dependency
on dr_mode being present where it is no longer available
after this revert (it is available elsewhere).
Issue: https://gitlab.com/LineageOS/issues/android/-/issues/8226
Issue: calyxos#2970
Change-Id: I03e9b7960d62999e019464b538a2642644e7fc6c
To avoid any OOB write or other security issues, it's good to
validate the buffer size before accessing it.
Change-Id: Ibfdef21293c9385119cfb6338ef36e20c0fc1f2f
Signed-off-by: Pulkit Singh Tak <quic_ptak@quicinc.com>
(cherry picked from commit 8ee6cd6bef52b65e6afba63310e07891e39eb38e)
As we are reading the packet from a shared queue, there is a
possibility to corrupt the packet->size data of shared queue by
malicious FW after validating it in the kernel driver.
Change-Id: I9bff8f2daa64054eada37de54fe3fa837d57b22a
Signed-off-by: Aniruddh Sharma <quic_anirshar@quicinc.com>
Signed-off-by: Madhu Ananthula <quic_mananthu@quicinc.com>
(cherry picked from commit bb8a45801d247ebe5201e1e97d4e890142d6cbf7)
Block USB enumeration from the get-go by blocking role switches away from
USB_ROLE_NONE when usb_data_enabled is false.
Change-Id: I0eff78e56e4a3b64262f220a085cfec5910baf30
Signed-off-by: Sultan Alsawaf <sultan@osomprivacy.com>
If USB controller is configured to work in a specific mode, reject
any compatible role/mode request. For example, if device mode is
only allowed, switching to host mode/role is not allowed. The current
code allows this and it results in accessing invalid dwc host structures.
Change-Id: I5e4d905c8240ad228f48b40fe36298029d8770e1
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
To avoid dependencies for the DWC3 core device to be present during
dwc3_msm_probe(), read out the dr_mode property from the DT node directly.
Since this property can not dynamically change, it will be the same per
compile time setting.
Change-Id: I3b56bde13af141ea01f06ea5b81e44bc034bf7b1
Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
commit 900f83cf376bdaf798b6f5dcb2eae0c822e908b6 upstream.
When evaluating extended permissions, ignore unknown permissions instead
of calling BUG(). This commit ensures that future permissions can be
added without interfering with older kernels.
Cc: stable@vger.kernel.org
Fixes: fa1aa143ac ("selinux: extended permissions for ioctls")
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Thiébaud Weksteen <tweek@google.com>
Change-Id: I1689d8c5084a24c1a34ef3d15d71f8cfa4122447