Skip to content

Commit

Permalink
list_lru: allow explicit memcg and NUMA node selection
Browse files Browse the repository at this point in the history
Patch series "workload-specific and memory pressure-driven zswap
writeback", v8.

There are currently several issues with zswap writeback:

1. There is only a single global LRU for zswap, making it impossible to
   perform worload-specific shrinking - an memcg under memory pressure
   cannot determine which pages in the pool it owns, and often ends up
   writing pages from other memcgs. This issue has been previously
   observed in practice and mitigated by simply disabling
   memcg-initiated shrinking:

   https://lore.kernel.org/all/[email protected]/T/#u

   But this solution leaves a lot to be desired, as we still do not
   have an avenue for an memcg to free up its own memory locked up in
   the zswap pool.

2. We only shrink the zswap pool when the user-defined limit is hit.
   This means that if we set the limit too high, cold data that are
   unlikely to be used again will reside in the pool, wasting precious
   memory. It is hard to predict how much zswap space will be needed
   ahead of time, as this depends on the workload (specifically, on
   factors such as memory access patterns and compressibility of the
   memory pages).

This patch series solves these issues by separating the global zswap LRU
into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e
memcg- and NUMA-aware) zswap writeback under memory pressure.  The new
shrinker does not have any parameter that must be tuned by the user, and
can be opted in or out on a per-memcg basis.

As a proof of concept, we ran the following synthetic benchmark: build the
linux kernel in a memory-limited cgroup, and allocate some cold data in
tmpfs to see if the shrinker could write them out and improved the overall
performance.  Depending on the amount of cold data generated, we observe
from 14% to 35% reduction in kernel CPU time used in the kernel builds.


This patch (of 6):

The interface of list_lru is based on the assumption that the list node
and the data it represents belong to the same allocated on the correct
node/memcg.  While this assumption is valid for existing slab objects LRU
such as dentries and inodes, it is undocumented, and rather inflexible for
certain potential list_lru users (such as the upcoming zswap shrinker and
the THP shrinker).  It has caused us a lot of issues during our
development.

This patch changes list_lru interface so that the caller must explicitly
specify numa node and memcg when adding and removing objects.  The old
list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and
list_lru_del_obj(), respectively.

It also extends the list_lru API with a new function, list_lru_putback,
which undoes a previous list_lru_isolate call.  Unlike list_lru_add, it
does not increment the LRU node count (as list_lru_isolate does not
decrement the node count).  list_lru_putback also allows for explicit
memcg and NUMA node selection.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Nhat Pham <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Tested-by: Bagas Sanjaya <[email protected]>
Cc: Chris Li <[email protected]>
Cc: Dan Streetman <[email protected]>
Cc: Domenico Cerasuolo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Seth Jennings <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: Vitaly Wool <[email protected]>
Cc: Yosry Ahmed <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
  • Loading branch information
nhatsmrt authored and akpm00 committed Dec 12, 2023
1 parent 330018f commit 0a97c01
Show file tree
Hide file tree
Showing 12 changed files with 117 additions and 36 deletions.
7 changes: 3 additions & 4 deletions drivers/android/binder_alloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,
if (page->page_ptr) {
trace_binder_alloc_lru_start(alloc, index);

on_lru = list_lru_del(&binder_alloc_lru, &page->lru);
on_lru = list_lru_del_obj(&binder_alloc_lru, &page->lru);
WARN_ON(!on_lru);

trace_binder_alloc_lru_end(alloc, index);
Expand Down Expand Up @@ -285,7 +285,7 @@ static int binder_update_page_range(struct binder_alloc *alloc, int allocate,

trace_binder_free_lru_start(alloc, index);

ret = list_lru_add(&binder_alloc_lru, &page->lru);
ret = list_lru_add_obj(&binder_alloc_lru, &page->lru);
WARN_ON(!ret);

trace_binder_free_lru_end(alloc, index);
Expand Down Expand Up @@ -848,7 +848,7 @@ void binder_alloc_deferred_release(struct binder_alloc *alloc)
if (!alloc->pages[i].page_ptr)
continue;

on_lru = list_lru_del(&binder_alloc_lru,
on_lru = list_lru_del_obj(&binder_alloc_lru,
&alloc->pages[i].lru);
page_addr = alloc->buffer + i * PAGE_SIZE;
binder_alloc_debug(BINDER_DEBUG_BUFFER_ALLOC,
Expand Down Expand Up @@ -1287,4 +1287,3 @@ int binder_alloc_copy_from_buffer(struct binder_alloc *alloc,
return binder_alloc_do_buffer_copy(alloc, false, buffer, buffer_offset,
dest, bytes);
}

8 changes: 5 additions & 3 deletions fs/dcache.c
Original file line number Diff line number Diff line change
Expand Up @@ -428,7 +428,8 @@ static void d_lru_add(struct dentry *dentry)
this_cpu_inc(nr_dentry_unused);
if (d_is_negative(dentry))
this_cpu_inc(nr_dentry_negative);
WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
WARN_ON_ONCE(!list_lru_add_obj(
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

static void d_lru_del(struct dentry *dentry)
Expand All @@ -438,7 +439,8 @@ static void d_lru_del(struct dentry *dentry)
this_cpu_dec(nr_dentry_unused);
if (d_is_negative(dentry))
this_cpu_dec(nr_dentry_negative);
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
WARN_ON_ONCE(!list_lru_del_obj(
&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

static void d_shrink_del(struct dentry *dentry)
Expand Down Expand Up @@ -1240,7 +1242,7 @@ static enum lru_status dentry_lru_isolate(struct list_head *item,
*
* This is guaranteed by the fact that all LRU management
* functions are intermediated by the LRU API calls like
* list_lru_add and list_lru_del. List movement in this file
* list_lru_add_obj and list_lru_del_obj. List movement in this file
* only ever occur through this functions or through callbacks
* like this one, that are called from the LRU API.
*
Expand Down
6 changes: 3 additions & 3 deletions fs/gfs2/quota.c
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@ static struct gfs2_quota_data *gfs2_qd_search_bucket(unsigned int hash,
if (qd->qd_sbd != sdp)
continue;
if (lockref_get_not_dead(&qd->qd_lockref)) {
list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
list_lru_del_obj(&gfs2_qd_lru, &qd->qd_lru);
return qd;
}
}
Expand Down Expand Up @@ -344,7 +344,7 @@ static void qd_put(struct gfs2_quota_data *qd)
}

qd->qd_lockref.count = 0;
list_lru_add(&gfs2_qd_lru, &qd->qd_lru);
list_lru_add_obj(&gfs2_qd_lru, &qd->qd_lru);
spin_unlock(&qd->qd_lockref.lock);
}

Expand Down Expand Up @@ -1517,7 +1517,7 @@ void gfs2_quota_cleanup(struct gfs2_sbd *sdp)
lockref_mark_dead(&qd->qd_lockref);
spin_unlock(&qd->qd_lockref.lock);

list_lru_del(&gfs2_qd_lru, &qd->qd_lru);
list_lru_del_obj(&gfs2_qd_lru, &qd->qd_lru);
list_add(&qd->qd_lru, &dispose);
}
spin_unlock(&qd_lock);
Expand Down
4 changes: 2 additions & 2 deletions fs/inode.c
Original file line number Diff line number Diff line change
Expand Up @@ -464,7 +464,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate)
if (!mapping_shrinkable(&inode->i_data))
return;

if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_inc(nr_unused);
else if (rotate)
inode->i_state |= I_REFERENCED;
Expand All @@ -482,7 +482,7 @@ void inode_add_lru(struct inode *inode)

static void inode_lru_list_del(struct inode *inode)
{
if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
}

Expand Down
8 changes: 4 additions & 4 deletions fs/nfs/nfs42xattr.c
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ nfs4_xattr_entry_lru_add(struct nfs4_xattr_entry *entry)
lru = (entry->flags & NFS4_XATTR_ENTRY_EXTVAL) ?
&nfs4_xattr_large_entry_lru : &nfs4_xattr_entry_lru;

return list_lru_add(lru, &entry->lru);
return list_lru_add_obj(lru, &entry->lru);
}

static bool
Expand All @@ -143,7 +143,7 @@ nfs4_xattr_entry_lru_del(struct nfs4_xattr_entry *entry)
lru = (entry->flags & NFS4_XATTR_ENTRY_EXTVAL) ?
&nfs4_xattr_large_entry_lru : &nfs4_xattr_entry_lru;

return list_lru_del(lru, &entry->lru);
return list_lru_del_obj(lru, &entry->lru);
}

/*
Expand Down Expand Up @@ -349,7 +349,7 @@ nfs4_xattr_cache_unlink(struct inode *inode)

oldcache = nfsi->xattr_cache;
if (oldcache != NULL) {
list_lru_del(&nfs4_xattr_cache_lru, &oldcache->lru);
list_lru_del_obj(&nfs4_xattr_cache_lru, &oldcache->lru);
oldcache->inode = NULL;
}
nfsi->xattr_cache = NULL;
Expand Down Expand Up @@ -474,7 +474,7 @@ nfs4_xattr_get_cache(struct inode *inode, int add)
kref_get(&cache->ref);
nfsi->xattr_cache = cache;
cache->inode = inode;
list_lru_add(&nfs4_xattr_cache_lru, &cache->lru);
list_lru_add_obj(&nfs4_xattr_cache_lru, &cache->lru);
}

spin_unlock(&inode->i_lock);
Expand Down
4 changes: 2 additions & 2 deletions fs/nfsd/filecache.c
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,7 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
static bool nfsd_file_lru_add(struct nfsd_file *nf)
{
set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
if (list_lru_add(&nfsd_file_lru, &nf->nf_lru)) {
if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
trace_nfsd_file_lru_add(nf);
return true;
}
Expand All @@ -331,7 +331,7 @@ static bool nfsd_file_lru_add(struct nfsd_file *nf)

static bool nfsd_file_lru_remove(struct nfsd_file *nf)
{
if (list_lru_del(&nfsd_file_lru, &nf->nf_lru)) {
if (list_lru_del_obj(&nfsd_file_lru, &nf->nf_lru)) {
trace_nfsd_file_lru_del(nf);
return true;
}
Expand Down
6 changes: 3 additions & 3 deletions fs/xfs/xfs_buf.c
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ xfs_buf_stale(

atomic_set(&bp->b_lru_ref, 0);
if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
(list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
(list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru)))
atomic_dec(&bp->b_hold);

ASSERT(atomic_read(&bp->b_hold) >= 1);
Expand Down Expand Up @@ -1047,7 +1047,7 @@ xfs_buf_rele(
* buffer for the LRU and clear the (now stale) dispose list
* state flag
*/
if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
if (list_lru_add_obj(&bp->b_target->bt_lru, &bp->b_lru)) {
bp->b_state &= ~XFS_BSTATE_DISPOSE;
atomic_inc(&bp->b_hold);
}
Expand All @@ -1060,7 +1060,7 @@ xfs_buf_rele(
* was on was the disposal list
*/
if (!(bp->b_state & XFS_BSTATE_DISPOSE)) {
list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
list_lru_del_obj(&bp->b_target->bt_lru, &bp->b_lru);
} else {
ASSERT(list_empty(&bp->b_lru));
}
Expand Down
2 changes: 1 addition & 1 deletion fs/xfs/xfs_dquot.c
Original file line number Diff line number Diff line change
Expand Up @@ -1065,7 +1065,7 @@ xfs_qm_dqput(
struct xfs_quotainfo *qi = dqp->q_mount->m_quotainfo;
trace_xfs_dqput_free(dqp);

if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
if (list_lru_add_obj(&qi->qi_lru, &dqp->q_lru))
XFS_STATS_INC(dqp->q_mount, xs_qm_dquot_unused);
}
xfs_dqunlock(dqp);
Expand Down
2 changes: 1 addition & 1 deletion fs/xfs/xfs_qm.c
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ xfs_qm_dqpurge(
* hits zero, so it really should be on the freelist here.
*/
ASSERT(!list_empty(&dqp->q_lru));
list_lru_del(&qi->qi_lru, &dqp->q_lru);
list_lru_del_obj(&qi->qi_lru, &dqp->q_lru);
XFS_STATS_DEC(dqp->q_mount, xs_qm_dquot_unused);

xfs_qm_dqdestroy(dqp);
Expand Down
54 changes: 51 additions & 3 deletions include/linux/list_lru.h
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
* list_lru_add: add an element to the lru list's tail
* @lru: the lru pointer
* @item: the item to be added.
* @nid: the node id of the sublist to add the item to.
* @memcg: the cgroup of the sublist to add the item to.
*
* If the element is already part of a list, this function returns doing
* nothing. Therefore the caller does not need to keep state about whether or
Expand All @@ -87,20 +89,50 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
*
* Return: true if the list was updated, false otherwise
*/
bool list_lru_add(struct list_lru *lru, struct list_head *item);
bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg);

/**
* list_lru_del: delete an element to the lru list
* list_lru_add_obj: add an element to the lru list's tail
* @lru: the lru pointer
* @item: the item to be added.
*
* This function is similar to list_lru_add(), but the NUMA node and the
* memcg of the sublist is determined by @item list_head. This assumption is
* valid for slab objects LRU such as dentries, inodes, etc.
*
* Return value: true if the list was updated, false otherwise
*/
bool list_lru_add_obj(struct list_lru *lru, struct list_head *item);

/**
* list_lru_del: delete an element from the lru list
* @lru: the lru pointer
* @item: the item to be deleted.
* @nid: the node id of the sublist to delete the item from.
* @memcg: the cgroup of the sublist to delete the item from.
*
* This function works analogously as list_lru_add() in terms of list
* manipulation. The comments about an element already pertaining to
* a list are also valid for list_lru_del().
*
* Return: true if the list was updated, false otherwise
*/
bool list_lru_del(struct list_lru *lru, struct list_head *item);
bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg);

/**
* list_lru_del_obj: delete an element from the lru list
* @lru: the lru pointer
* @item: the item to be deleted.
*
* This function is similar to list_lru_del(), but the NUMA node and the
* memcg of the sublist is determined by @item list_head. This assumption is
* valid for slab objects LRU such as dentries, inodes, etc.
*
* Return value: true if the list was updated, false otherwise.
*/
bool list_lru_del_obj(struct list_lru *lru, struct list_head *item);

/**
* list_lru_count_one: return the number of objects currently held by @lru
Expand Down Expand Up @@ -138,6 +170,22 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
void list_lru_isolate(struct list_lru_one *list, struct list_head *item);
void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
struct list_head *head);
/**
* list_lru_putback: undo list_lru_isolate
* @lru: the lru pointer.
* @item: the item to put back.
* @nid: the node id of the sublist to put the item back to.
* @memcg: the cgroup of the sublist to put the item back to.
*
* Put back an isolated item into its original LRU. Note that unlike
* list_lru_add, this does not increment the node LRU count (as
* list_lru_isolate does not originally decrement this count).
*
* Since we might have dropped the LRU lock in between, recompute list_lru_one
* from the node's id and memcg.
*/
void list_lru_putback(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg);

typedef enum lru_status (*list_lru_walk_cb)(struct list_head *item,
struct list_lru_one *list, spinlock_t *lock, void *cb_arg);
Expand Down
48 changes: 40 additions & 8 deletions mm/list_lru.c
Original file line number Diff line number Diff line change
Expand Up @@ -116,21 +116,19 @@ list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr,
}
#endif /* CONFIG_MEMCG_KMEM */

bool list_lru_add(struct list_lru *lru, struct list_head *item)
bool list_lru_add(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
struct mem_cgroup *memcg;
struct list_lru_one *l;

spin_lock(&nlru->lock);
if (list_empty(item)) {
l = list_lru_from_kmem(lru, nid, item, &memcg);
l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
list_add_tail(item, &l->list);
/* Set shrinker bit if the first element was added */
if (!l->nr_items++)
set_shrinker_bit(memcg, nid,
lru_shrinker_id(lru));
set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
Expand All @@ -140,15 +138,25 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
}
EXPORT_SYMBOL_GPL(list_lru_add);

bool list_lru_del(struct list_lru *lru, struct list_head *item)
bool list_lru_add_obj(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct mem_cgroup *memcg = list_lru_memcg_aware(lru) ?
mem_cgroup_from_slab_obj(item) : NULL;

return list_lru_add(lru, item, nid, memcg);
}
EXPORT_SYMBOL_GPL(list_lru_add_obj);

bool list_lru_del(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg)
{
struct list_lru_node *nlru = &lru->node[nid];
struct list_lru_one *l;

spin_lock(&nlru->lock);
if (!list_empty(item)) {
l = list_lru_from_kmem(lru, nid, item, NULL);
l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
list_del_init(item);
l->nr_items--;
nlru->nr_items--;
Expand All @@ -160,6 +168,16 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
}
EXPORT_SYMBOL_GPL(list_lru_del);

bool list_lru_del_obj(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct mem_cgroup *memcg = list_lru_memcg_aware(lru) ?
mem_cgroup_from_slab_obj(item) : NULL;

return list_lru_del(lru, item, nid, memcg);
}
EXPORT_SYMBOL_GPL(list_lru_del_obj);

void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
{
list_del_init(item);
Expand All @@ -175,6 +193,20 @@ void list_lru_isolate_move(struct list_lru_one *list, struct list_head *item,
}
EXPORT_SYMBOL_GPL(list_lru_isolate_move);

void list_lru_putback(struct list_lru *lru, struct list_head *item, int nid,
struct mem_cgroup *memcg)
{
struct list_lru_one *list =
list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));

if (list_empty(item)) {
list_add_tail(item, &list->list);
if (!list->nr_items++)
set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
}
}
EXPORT_SYMBOL_GPL(list_lru_putback);

unsigned long list_lru_count_one(struct list_lru *lru,
int nid, struct mem_cgroup *memcg)
{
Expand Down
4 changes: 2 additions & 2 deletions mm/workingset.c
Original file line number Diff line number Diff line change
Expand Up @@ -615,12 +615,12 @@ void workingset_update_node(struct xa_node *node)

if (node->count && node->count == node->nr_values) {
if (list_empty(&node->private_list)) {
list_lru_add(&shadow_nodes, &node->private_list);
list_lru_add_obj(&shadow_nodes, &node->private_list);
__inc_lruvec_kmem_state(node, WORKINGSET_NODES);
}
} else {
if (!list_empty(&node->private_list)) {
list_lru_del(&shadow_nodes, &node->private_list);
list_lru_del_obj(&shadow_nodes, &node->private_list);
__dec_lruvec_kmem_state(node, WORKINGSET_NODES);
}
}
Expand Down

0 comments on commit 0a97c01

Please sign in to comment.