一文搞定伙伴分配器

伙伴分配器

1.伙伴分配器原理

2.伙伴分配器的优缺点

3.伙伴分配器的分配释放流程

4.伙伴分配器的数据结构

5.备用区域列表

6.伙伴分配器的结构

7.内存区域水线

8.伙伴分配器分配过程分析

linux内存三大分配器：引导内存分配器，伙伴分配器，slab分配器

伙伴分配器

当系统内核初始化完毕后，使用页分配器管理物理页，当使用的页分配器是伙伴分配器，伙伴分配器的特点是算法简单且高效，支持内存节点和区域，为了预防内存碎片，把物理内存根据可移动性分组，针对分配单页做了性能优化，为了减少处理器的锁竞争，在内存区域增加1个每处理器页集合。

1.伙伴分配器原理

连续的物理页称为页块（page block）。阶（order）是伙伴分配器的一个专业术语，是页的数量单位，2^n 个连续页称为n阶页块。物理内存被分成11个order:0 ~ 10,每个order中连续page的个数是2order,如果一个order中可用的memory size小于期望分配的size，那么更大order的内存块会被对半切分，切分之后的两个小块互为buddies。其中一个子块用于分配，另一个空闲的。这些块在必要时会连续减半，直到达到所需大小的memory 块为止，当一个block被释放之后，会检查它的buddies是否也是空闲的，如果是，那么这对buddies将会被合并。
满足以下条件的两个n阶页块称为伙伴：
1）两个页块是相邻的，即物理地址是连续的；
2）页块的第一页的物理页号必须是2^n 的整数倍；
3）如果合并成（n+1）阶页块，第一页的物理页号必须是2^(n+1) 的整数倍。

2.伙伴分配器的优缺点

优点：由于将物理内存按照PFN将不同的page放入到不同order中，根据需要分配内存的大小，计算当前这次分配应该在哪个order中去找空闲的内存块，如果当前order中没有空闲，则到更高阶的order中去查找，因此分配的效率比boot memory的线性扫描bitmap要快很多。
缺点：
1）释放page的时候调用方必须记住之前该page分配的order，然后释放从该page开始的2order 个page，这对于调用者来说有点不方便
2）因为buddy allocator每次分配必须是2order 个page同时分配，这样当实际需要内存大小小于2order 时，就会造成内存浪费，所以Linux为了解决buddy allocator造成的内部碎片问题，后面会引入slab分配器。

3.伙伴分配器的分配释放流程

伙伴分配器分配和释放物理页的数量单位为阶。分配n阶页块的过程如下：
1）查看是否有空闲的n阶页块，如果有直接分配；否则，继续执行下一步；
2）查看是否存在空闲的（n+1）阶页块，如果有，把（n+1）阶页块分裂为两个n阶页块，一个插入空闲n阶页块链表，另一个分配出去；否则继续执行下一步。
3）查看是否存在空闲的（n+2）阶页块，如果有把（n+2）阶页块分裂为两个（n+1）阶页块，一个插入空闲（n+1）阶页块链表，另一个分裂为两个n阶页块，一个插入空间n阶页块链表，另一个分配出去；如果没有，继续查看更高阶是否存在空闲页块。

4.伙伴分配器的数据结构

分区的伙伴分配器专注于某个内存节点的某个区域。内存区域的结构体成员free_area用来维护空闲页块，数组下标对应页块的阶数。

内核源码结构：

struct free_area {
 struct list_head free_list[MIGRATE_TYPES];
 unsigned long  nr_free;
};

内核使用GFP_ZONE_TABLE 定义了区域类型映射表的标志组合，其中GFP_ZONES_SHIFT是区域类型占用的位数，GFP_ZONE_TABLE 把每种标志组合映射到32位整数的某个位置，偏移是（标志组合*区域类型位数），从这个偏移开始的GFP_ZONES_SHIFT个二进制存放区域类型。

#define GFP_ZONE_TABLE ( 
 (ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)           
 | (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)         
 | (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)        
 | (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)         
 | (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)         
 | (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)    
 | (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)
 | (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)
)
//根据flags标志获取首选区域
#define ___GFP_DMA  0x01u
#define ___GFP_HIGHMEM  0x02u
#define ___GFP_DMA32  0x04u
#define ___GFP_MOVABLE  0x08u

5.备用区域列表

备用区域这个东西很重要，但是我现在也不能完完全全的了解他，只知道他可以加快我们申请内存的速度，下面的快速路径会用到他。
如果首选的内存节点或区域不能满足分配请求，可以从备用的内存区域借用物理页。借用必须遵守相应的规则。
借用规则：
1）一个内存节点的某个区域类型可以从另外一个内存节点的相同区域类型借用物理页，比如节点0的普通区域可以从节点为1的普通区域借用物理页。
2）高区域类型的可以从地区域类型借用物理页，比如普通区域可以从DMA区域借用物理页
3）地区域类型的不可以从高区域类型借用物理页，比如DMA区域不可以从普通区域借用物理页
内存节点的结构体pg_data_t实例已定义备用区域列表node_zonelists。

6.伙伴分配器的结构

内核源码如下：

typedef struct pglist_data {
 struct zone node_zones[MAX_NR_ZONES];//内存区域数组
 struct zonelist node_zonelists[MAX_ZONELISTS];//MAX_ZONELISTS个备用区域数组

 int nr_zones;//该节点包含的内存区域数量
......
}
//struct zone在linux内存管理（一）中
struct zonelist {
 struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
struct zoneref {
 struct zone *zone;//指向内存区域数据结构
 int zone_idx;//成员zone指向内存区域的类型
};
enum {
 ZONELIST_FALLBACK,//包含所有内存节点的的备用区域列表
#ifdef CONFIG_NUMA
 /*
  * The NUMA zonelists are doubled because we need zonelists that
  * restrict the allocations to a single node for __GFP_THISNODE.
  */
 ZONELIST_NOFALLBACK,//只包含当前节点的备用区域列表（NUMA专用）
#endif
 MAX_ZONELISTS//表示备用区域列表数量
};

UMA系统只有一个备用区域的列表，按照区域类型从高到低顺序排列。假设UMA系统中包含普通区域和DMA区域，则备用区域列表为:(普通区域、MDA区域)。NUMA系统中每个内存节点有两个备用区域列表:一个包含所有节点的内存区域，另一个仅包含当前节点的内存区域。

ZONELIST_FALLBACK（包含所有内存节点的备用区域）列表有两种排序方法:
a.节点优先顺序
先根据节点距离从小到大排序, 然后在每个节点里面根据区域类型从高到低排序。
优点是优先选择距离近的内存, 缺点是在高区域耗尽以前使用低区域。
b.区域优先顺序
先根据区域类型从高到低排序, 然后在每个区域类型里面根据节点距离从小到大排序。
优点是减少低区域耗尽的概率, 缺点是不能保证优先选择距离近的内存。
默认的排序方法就是自动选择最优的排序方法：比如是64位系统，因为需要DMA和DMA32区域的备用相对少，所以选择节点优先顺序；如果是32位系统，选择区域优先顺序。

7.内存区域水线

首选的内存区域什么情况下从备用区域借用物理页呢？每个内存区域有3个水线：
a.高水线（high）：如果内存区域的空闲页数大于高水线，说明内存区域的内存非常充足；
b.低水线（low）：如果内存区域的空闲页数小于低水线，说明内存区域的内存轻微不足；
c.最低水线（min）：如果内存区域的空闲页数小于最低水线，说明内存区域的内存严重不足。
而且每个区域的水位线是初始化的时候通过每个区域的物理页情况计算出来的。计算后存到struct zone的watermark数组中，使用的时候直接通过下面的宏定义获取：

#define min_wmark_pages(z) (z->watermark[WMARK_MIN])
#define low_wmark_pages(z) (z->watermark[WMARK_LOW])
#define high_wmark_pages(z) (z->watermark[WMARK_HIGH])

struct zone的数据结构：

spanned_pages = zone_end_pfn - zone_start_pfn;//区域结束的物理页减去起始页=当前区域跨越的总页数（包括空洞）
present_pages = spanned_pages - absent_pages(pages in holes)//当前区域跨越的总页数-空洞页数=当前区域可用物理页数
managed_pages = present_pages - reserved_pages//当前区域可用物理页数-预留的页数=伙伴分配器管理物理页数

最低水线以下的内存称为紧急保留内存，一般用于内存回收，其他情况不可以动用紧急保留内存，在内存严重不足的紧急情况下，给承诺"分给我们少量的紧急保留内存使用，我可以释放更多的内存"的进程使用。

可以通过/proc/zoneinfo看到系统zone的水位线和物理页情况

jian@ubuntu:~/share/linux-4.19.40-note$ cat /proc/zoneinfo 
Node 0, zone      DMA
  pages free     3912
        min      7
        low      8
        high     10
        scanned  0
        spanned  4095
        present  3997
        managed  3976
...
Node 0, zone    DMA32
  pages free     6515
        min      1497
        low      1871
        high     2245
        scanned  0
        spanned  1044480
        present  782288
        managed  762172
  ...
Node 0, zone   Normal
  pages free     2964
        min      474
        low      592
        high     711
        scanned  0
        spanned  262144
        present  262144
        managed  241089
  ...

8.伙伴分配器分配过程分析

当向内核请求分配 (2^(i-1)，2^i]数目的页块时，按照 2^i 页块请求处理。如果对应的页块链表中没有空闲页块，那我们就在更大的页块链表中去找。当分配的页块中有多余的页时，伙伴系统会根据多余的页块大小插入到对应的空闲页块链表中。
例如，要请求一个 128 个页的页块时，先检查 128 个页的页块链表是否有空闲块。如果没有，则查 256 个页的页块链表；如果有空闲块的话，则将 256 个页的页块分成两份，一份使用，一份插入 128 个页的页块链表中。如果还是没有，就查 512 个页的页块链表；如果有的话，就分裂为 128、128、256 三个页块，一个 128 的使用，剩余两个插入对应页块链表。
伙伴分配器进行页分配的时候首先调用alloc_pages，alloc_pages 会调用 alloc_pages_current，alloc_pages_current会调用__alloc_pages_nodemask函数，他是伙伴分配器的核心函数：

/* The ALLOC_WMARK bits are used as an index to zone->watermark */
#define ALLOC_WMARK_MIN  WMARK_MIN //使用最低水线
#define ALLOC_WMARK_LOW  WMARK_LOW //使用低水线
#define ALLOC_WMARK_HIGH WMARK_HIGH //使用高水线
#define ALLOC_NO_WATERMARKS 0x04   //完全不检查水线
#define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1)//得到水位线的掩码
#ifdef CONFIG_MMU
#define ALLOC_OOM  0x08 //允许内存耗尽
#else
#define ALLOC_OOM  ALLOC_NO_WATERMARKS//允许内存耗尽
#endif
#define ALLOC_HARDER  0x10 //试图更努力分配
#define ALLOC_HIGH   0x20 //调用者是高优先级
#define ALLOC_CPUSET  0x40 //检查 cpuset 是否允许进程从某个内存节点分配页
#define ALLOC_CMA   0x80 //允许从CMA（连续内存分配器）迁移类型分配

上面是alloc_pages的第一个参数分配标志位，表示分配的允许情况，alloc_pages的第二个参数表示分配的阶数

static inline struct page *
alloc_pages(gfp_t gfp_mask, unsigned int order)
{
 return alloc_pages_current(gfp_mask, order);
}

struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
 struct mempolicy *pol = &default_policy;
 struct page *page;

 if (!in_interrupt() && !(gfp & __GFP_THISNODE))
  pol = get_task_policy(current);

 if (pol->mode == MPOL_INTERLEAVE)
  page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 else
  page = __alloc_pages_nodemask(gfp, order,
    policy_node(gfp, pol, numa_node_id()),
    policy_nodemask(gfp, pol));

 return page;
}

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
       nodemask_t *nodemask)
{
 ...
 /* First allocation attempt */ //快速路径分配函数
 page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 if (likely(page))
  goto out;
 ...
 //快速路径分配失败，会调用下面的慢速分配函数
 page = __alloc_pages_slowpath(alloc_mask, order, &ac);

out:
 if (memcg_kmem_enabled() && (gfp_mask & __GFP_ACCOUNT) && page &&
     unlikely(memcg_kmem_charge(page, gfp_mask, order) != 0)) {
  __free_pages(page, order);
  page = NULL;
 }

 trace_mm_page_alloc(page, order, alloc_mask, ac.migratetype);

 return page;
}

从伙伴分配器的核心函数__alloc_pages_nodemask可以看到函数主要两部分，一是执行快速分配函数get_page_from_freelist，二是执行慢速分配函数__alloc_pages_slowpath。现在先看快速分配函数get_page_from_freelist

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
      const struct alloc_context *ac)
{
 struct zoneref *z = ac->preferred_zoneref;
 struct zone *zone;
 struct pglist_data *last_pgdat_dirty_limit = NULL;

 //扫描备用区域列表中每一个满足条件的区域：区域类型小于等于首选区域类型
 for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
        ac->nodemask) {
  struct page *page;
  unsigned long mark;

  if (cpusets_enabled() &&   //如果编译了cpuset功能  
   (alloc_flags & ALLOC_CPUSET) && //如果设置了ALLOC_CPUSET
   !__cpuset_zone_allowed(zone, gfp_mask)) //如果cpu设置了不允许从当前区域分配内存
    continue;       //那么不允许从这个区域分配，进入下个循环
  
  if (ac->spread_dirty_pages) {//如果设置了写标志位，表示要分配写缓存
   //那么要检查内存脏页数量是否超出限制，超过限制就不能从这个区域分配
   if (last_pgdat_dirty_limit == zone->zone_pgdat)
    continue;

   if (!node_dirty_ok(zone->zone_pgdat)) {
    last_pgdat_dirty_limit = zone->zone_pgdat;
    continue;
   }
  }

  mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];//检查允许分配水线
  //判断（区域空闲页-申请页数）是否小于水线
  if (!zone_watermark_fast(zone, order, mark,
           ac_classzone_idx(ac), alloc_flags)) {
   int ret;

   /* Checked here to keep the fast path fast */
   BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
   //如果没有水线要求，直接选择该区域
   if (alloc_flags & ALLOC_NO_WATERMARKS)
    goto try_this_zone;

   //如果没有开启节点回收功能或者当前节点和首选节点距离大于回收距离
   if (node_reclaim_mode == 0 ||
       !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))
    continue;

   //从节点回收“没有映射到进程虚拟地址空间的内存页”，然后检查水线
   ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
   switch (ret) {
   case NODE_RECLAIM_NOSCAN:
    /* did not scan */
    continue;
   case NODE_RECLAIM_FULL:
    /* scanned but unreclaimable */
    continue;
   default:
    /* did we reclaim enough */
    if (zone_watermark_ok(zone, order, mark,
      ac_classzone_idx(ac), alloc_flags))
     goto try_this_zone;

    continue;
   }
  }

try_this_zone://满足上面的条件了，开始分配
  //从当前区域分配页
  page = rmqueue(ac->preferred_zoneref->zone, zone, order,
    gfp_mask, alloc_flags, ac->migratetype);
  if (page) {
   //分配成功，初始化页
   prep_new_page(page, order, gfp_mask, alloc_flags);

   /*
    * If this is a high-order atomic allocation then check
    * if the pageblock should be reserved for the future
    */
   //如果这是一个高阶的内存并且是ALLOC_HARDER，需要检查以后是否需要保留
   if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
    reserve_highatomic_pageblock(page, zone, order);

   return page;
  } else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
   /* Try again if zone has deferred pages */
   //如果分配失败，延迟分配
   if (static_branch_unlikely(&deferred_pages)) {
    if (_deferred_grow_zone(zone, order))
     goto try_this_zone;
   }
#endif
  }
 }

 return NULL;
}

每一个 zone，都有伙伴系统维护的各种大小的队列，就像上面伙伴系统原理里讲的那样。这里调用 rmqueue 就很好理解了，就是找到合适大小的那个队列，把页面取下来。接下来的调用链是 rmqueue->__rmqueue->__rmqueue_smallest。在这里，我们能清楚看到伙伴系统的逻辑。


static inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
            int migratetype)
{
  unsigned int current_order;
  struct free_area *area;
  struct page *page;


  /* Find a page of the appropriate size in the preferred list */
  for (current_order = order; current_order < MAX_ORDER; ++current_order) {
    area = &(zone->free_area[current_order]);
    page = list_first_entry_or_null(&area->free_list[migratetype],
              struct page, lru);
    if (!page)
      continue;
    list_del(&page->lru);
    rmv_page_order(page);
    area->nr_free--;
    expand(zone, page, order, current_order, area, migratetype);
    set_pcppage_migratetype(page, migratetype);
    return page;
  }


  return NULL;

从当前的 order，也即指数开始，在伙伴系统的 free_area 找 2^order 大小的页块。如果链表的第一个不为空，就找到了；如果为空，就到更大的 order 的页块链表里面去找。找到以后，除了将页块从链表中取下来，我们还要把多余部分放到其他页块链表里面。expand 就是干这个事情的。area–就是伙伴系统那个表里面的前一项，前一项里面的页块大小是当前项的页块大小除以 2，size 右移一位也就是除以 2，list_add 就是加到链表上，nr_free++ 就是计数加 1。
然后看看慢速分配函数__alloc_pages_slowpath：

static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
      struct alloc_context *ac)
{
 bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
 struct page *page = NULL;
 unsigned int alloc_flags;
 unsigned long did_some_progress;
 enum compact_priority compact_priority;
 enum compact_result compact_result;
 int compaction_retries;
 int no_progress_loops;
 unsigned int cpuset_mems_cookie;
 int reserve_flags;

 /*
  * We also sanity check to catch abuse of atomic reserves being used by
  * callers that are not in atomic context.
  */
 if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
    (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
  gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
 compaction_retries = 0;
 no_progress_loops = 0;
 compact_priority = DEF_COMPACT_PRIORITY;
 //后面可能会检查cpuset是否允许当前进程从哪些内存节点申请页
 cpuset_mems_cookie = read_mems_allowed_begin();

 /*
  * The fast path uses conservative alloc_flags to succeed only until
  * kswapd needs to be woken up, and to avoid the cost of setting up
  * alloc_flags precisely. So we do that now.
  */
 //把分配标志位转化为内部的分配标志位
 alloc_flags = gfp_to_alloc_flags(gfp_mask);

 /*
  * We need to recalculate the starting point for the zonelist iterator
  * because we might have used different nodemask in the fast path, or
  * there was a cpuset modification and we are retrying - otherwise we
  * could end up iterating over non-eligible zones endlessly.
  */
 //获取首选的内存区域，因为在快速路径中使用了不同的节点掩码，避免再次遍历不合格的区域。
 ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
     ac->high_zoneidx, ac->nodemask);
 if (!ac->preferred_zoneref->zone)
  goto nopage;
 
 //异步回收页，唤醒kswapd内核线程进行页面回收
 if (gfp_mask & __GFP_KSWAPD_RECLAIM)
  wake_all_kswapds(order, gfp_mask, ac);

 /*
  * The adjusted alloc_flags might result in immediate success, so try
  * that first
  */
 //调整alloc_flags后可能会立即申请成功，所以先尝试一下
 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 if (page)
  goto got_pg;

 /*
  * For costly allocations, try direct compaction first, as it's likely
  * that we have enough base pages and don't need to reclaim. For non-
  * movable high-order allocations, do that as well, as compaction will
  * try prevent permanent fragmentation by migrating from blocks of the
  * same migratetype.
  * Don't try this for allocations that are allowed to ignore
  * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
  */
 //申请阶数大于0，不可移动的位于高阶的，忽略水位线的
 if (can_direct_reclaim &&
   (costly_order ||
      (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
   && !gfp_pfmemalloc_allowed(gfp_mask)) {
  //直接页面回收，然后进行页面分配
  page = __alloc_pages_direct_compact(gfp_mask, order,
      alloc_flags, ac,
      INIT_COMPACT_PRIORITY,
      &compact_result);
  if (page)
   goto got_pg;

  /*
   * Checks for costly allocations with __GFP_NORETRY, which
   * includes THP page fault allocations
   */
  if (costly_order && (gfp_mask & __GFP_NORETRY)) {
   /*
    * If compaction is deferred for high-order allocations,
    * it is because sync compaction recently failed. If
    * this is the case and the caller requested a THP
    * allocation, we do not want to heavily disrupt the
    * system, so we fail the allocation instead of entering
    * direct reclaim.
    */
   if (compact_result == COMPACT_DEFERRED)
    goto nopage;

   /*
    * Looks like reclaim/compaction is worth trying, but
    * sync compaction could be very expensive, so keep
    * using async compaction.
    */
   //同步压缩非常昂贵，所以继续使用异步压缩
   compact_priority = INIT_COMPACT_PRIORITY;
  }
 }

retry:
 /* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 //如果页回收线程意外睡眠则再次唤醒
 if (gfp_mask & __GFP_KSWAPD_RECLAIM)
  wake_all_kswapds(order, gfp_mask, ac);

 //如果调用者承若给我们紧急内存使用，我们就忽略水线
 reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
 if (reserve_flags)
  alloc_flags = reserve_flags;

 /*
  * Reset the nodemask and zonelist iterators if memory policies can be
  * ignored. These allocations are high priority and system rather than
  * user oriented.
  */
 //如果可以忽略内存策略，则重置nodemask和zonelist
 if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
  ac->nodemask = NULL;
  ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
     ac->high_zoneidx, ac->nodemask);
 }

 /* Attempt with potentially adjusted zonelist and alloc_flags */
 //尝试使用可能调整的区域备用列表和分配标志
 page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 if (page)
  goto got_pg;

 /* Caller is not willing to reclaim, we can't balance anything */
 //如果不可以直接回收，则申请失败
 if (!can_direct_reclaim)
  goto nopage;

 /* Avoid recursion of direct reclaim */
 if (current->flags & PF_MEMALLOC)
  goto nopage;

 /* Try direct reclaim and then allocating */
 //直接页面回收，然后进行页面分配
 page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
       &did_some_progress);
 if (page)
  goto got_pg;

 /* Try direct compaction and then allocating */
 //进行页面压缩，然后进行页面分配
 page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
     compact_priority, &compact_result);
 if (page)
  goto got_pg;

 /* Do not loop if specifically requested */
 //如果调用者要求不要重试，则放弃
 if (gfp_mask & __GFP_NORETRY)
  goto nopage;

 /*
  * Do not retry costly high order allocations unless they are
  * __GFP_RETRY_MAYFAIL
  */
 //不要重试代价高昂的高阶分配，除非它们是__GFP_RETRY_MAYFAIL
 if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
  goto nopage;
 
 //重新尝试回收页
 if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
     did_some_progress > 0, &no_progress_loops))
  goto retry;

 /*
  * It doesn't make any sense to retry for the compaction if the order-0
  * reclaim is not able to make any progress because the current
  * implementation of the compaction depends on the sufficient amount
  * of free memory (see __compaction_suitable)
  */
 //如果申请阶数大于0，判断是否需要重新尝试压缩
 if (did_some_progress > 0 &&
   should_compact_retry(ac, order, alloc_flags,
    compact_result, &compact_priority,
    &compaction_retries))
  goto retry;


 /* Deal with possible cpuset update races before we start OOM killing */
 //如果cpuset允许修改内存节点申请就修改
 if (check_retry_cpuset(cpuset_mems_cookie, ac))
  goto retry_cpuset;

 /* Reclaim has failed us, start killing things */
 //使用oom选择一个进程杀死
 page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 if (page)
  goto got_pg;

 /* Avoid allocations with no watermarks from looping endlessly */
 //如果当前进程是oom选择的进程，并且忽略了水线，则放弃申请
 if (tsk_is_oom_victim(current) &&
     (alloc_flags == ALLOC_OOM ||
      (gfp_mask & __GFP_NOMEMALLOC)))
  goto nopage;

 /* Retry as long as the OOM killer is making progress */
 //如果OOM杀手正在取得进展，再试一次
 if (did_some_progress) {
  no_progress_loops = 0;
  goto retry;
 }

nopage:
 /* Deal with possible cpuset update races before we fail */
 if (check_retry_cpuset(cpuset_mems_cookie, ac))
  goto retry_cpuset;

 /*
  * Make sure that __GFP_NOFAIL request doesn't leak out and make sure
  * we always retry
  */
 if (gfp_mask & __GFP_NOFAIL) {
  /*
   * All existing users of the __GFP_NOFAIL are blockable, so warn
   * of any new users that actually require GFP_NOWAIT
   */
  if (WARN_ON_ONCE(!can_direct_reclaim))
   goto fail;

  /*
   * PF_MEMALLOC request from this context is rather bizarre
   * because we cannot reclaim anything and only can loop waiting
   * for somebody to do a work for us
   */
  WARN_ON_ONCE(current->flags & PF_MEMALLOC);

  /*
   * non failing costly orders are a hard requirement which we
   * are not prepared for much so let's warn about these users
   * so that we can identify them and convert them to something
   * else.
   */
  WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

  /*
   * Help non-failing allocations by giving them access to memory
   * reserves but do not use ALLOC_NO_WATERMARKS because this
   * could deplete whole memory reserves which would just make
   * the situation worse
   */
  //允许它们访问内存备用列表
  page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
  if (page)
   goto got_pg;

  cond_resched();
  goto retry;
 }
fail:
 warn_alloc(gfp_mask, ac->nodemask,
   "page allocation failure: order:%u", order);
got_pg:
 return page;
}