Yesterday several databases on one server started logging errors in the alert log:
1 2 3 4 5 |
ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sendmsg failed with status: 105 ORA-27301: OS failure message: No buffer space available ORA-27302: failure occurred at: sskgxpsnd2 |
That means not enough contiguous free memory in the OS. The first thing that I have checked has been of course the memory, and the used huge pages:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# [ oracle@oraserver1:/home/oracle [10:45:46] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ free total used free shared buff/cache available Mem: 528076056 398142940 3236764 119855448 126696352 5646964 Swap: 16760828 11615324 5145504 # [ oracle@oraserver1:/home/oracle [10:46:47] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ cat /proc/meminfo | grep Huge HugePages_Total: 180000 HugePages_Free: 86029 HugePages_Rsvd: 11507 HugePages_Surp: 0 Hugepagesize: 2048 kB |
The memory available (last column in the free
command) was indeed quite low, but still plenty of space in the huge pages (86k pages free out of 180k).
The usage by Oracle instances:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# [ oracle@oraserver1:/home/oracle [10:45:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ sh mem.sh DB12 : 54081544 DB22 : 37478820 DB32 : 67970828 DB42 : 14846552 DB52 : 16326380 DB62 : 15122048 DB82 : 56900472 DB92 : 14401080 DBA2 : 12622736 DBB2 : 14379916 DBC2 : 46078336 DBD2 : 46137728 DB72 : 37351336 total : 433697776 |
You can get the code of mem.sh in this post.
Regarding pure shared memory usage, the situation was what I was expecting:
1 2 |
$ ipcs -m | awk 'BEGIN{a=0} {a+=$5} END{print a}' 369394520064 |
360G of shared memory usage, much more than what was allocated in the huge pages.
I have compared the situation with the other node in the cluster: it had more memory allocated by the databases (because of more load on it), more huge page usage and less 4k pages consumption overall.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
$ sh mem.sh DB12 : 78678000 DB22 : 14220000 DB32 : 14287528 DB42 : 12369352 DB52 : 14868596 DB62 : 14633984 DB82 : 54316104 DB92 : 86148332 DBA2 : 61473288 DBB2 : 68678788 DBC2 : 9831288 DBD2 : 64759352 DB72 : 68114604 total : 562379216 $ free total used free shared buff/cache available Mem: 528076056 402288800 17100464 5818032 108686792 114351784 Swap: 16760828 47360 16713468 $ cat /proc/meminfo | grep Huge AnonHugePages: 10240 kB HugePages_Total: 176654 HugePages_Free: 15557 HugePages_Rsvd: 15557 HugePages_Surp: 0 Hugepagesize: 2048 kB |
So I was wondering if all the DBs were property allocating the SGA in huge pages or not.
This redhat page has been quite useful to create a quick snippet to check the huge page memory allocation per process:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# [ oracle@oraserver1:/home/oracle [10:55:27] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ cat /proc/707/numa_maps | grep -i hug 60000000 default file=/SYSV00000000\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048 70000000 default file=/SYSV00000000\040(deleted) huge dirty=1525 mapmax=57 N0=743 N1=782 kernelpagesize_kB=2048 c60000000 interleave:0-1 file=/SYSV0b46df00\040(deleted) huge dirty=1 mapmax=57 N0=1 kernelpagesize_kB=2048 # [ oracle@oraserver1:/home/oracle [10:56:39] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ function pshugepage () { > HUGEPAGECOUNT=0 > for num in `grep 'huge.*dirty=' /proc/$@/numa_maps | awk '{print $5}' | sed 's/dirty=//'` ; do > HUGEPAGECOUNT=$((HUGEPAGECOUNT+num)) > done > echo process $@ using $HUGEPAGECOUNT huge pages > } # [ oracle@oraserver1:/home/oracle [10:57:09] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ pshugepage 707 process 707 using 1527 huge pages # [ oracle@oraserver1:/home/oracle [10:57:11] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ for pid in `ps -eaf | grep [p]mon | awk '{print $2}'` ; do pshugepage $pid ; done process 707 using 1527 huge pages process 3685 using 2409 huge pages process 16092 using 3056 huge pages process 55718 using 0 huge pages process 58490 using 0 huge pages process 70583 using 0 huge pages process 94479 using 1135 huge pages process 98216 using 0 huge pages process 98755 using 0 huge pages process 100245 using 0 huge pages process 100265 using 0 huge pages process 100270 using 0 huge pages process 101681 using 0 huge pages process 179079 using 1699 huge pages process 189585 using 14566 huge pages |
It has been easy to spot the databases not using huge pages at all:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# [ oracle@oraserver1:/home/oracle [10:58:26] [19.3.0.0.0 [GRID] SID=GRID] 0 ] # $ ps -eaf | grep [p]mon oracle 707 1 0 Sep30 ? 00:23:55 ora_pmon_DB12 oracle 3685 1 0 Nov01 ? 00:09:17 ora_pmon_DB22 oracle 16092 1 0 Oct15 ? 00:04:15 ora_pmon_DB32 oracle 55718 1 0 Aug12 ? 00:08:25 asm_pmon_+ASM2 oracle 58490 1 0 Aug12 ? 00:08:24 apx_pmon_+APX2 oracle 70583 1 0 Aug12 ? 00:57:55 ora_pmon_DB42 oracle 94479 1 0 Oct02 ? 00:32:03 ora_pmon_DB52 oracle 98216 1 0 Aug12 ? 00:58:36 ora_pmon_DB62 oracle 98755 1 0 Aug12 ? 00:59:27 ora_pmon_DB82 oracle 100245 1 0 Aug12 ? 00:56:52 ora_pmon_DB92 oracle 100265 1 0 Aug12 ? 00:51:54 ora_pmon_DBA2 oracle 100270 1 0 Aug12 ? 00:54:57 ora_pmon_DBB2 oracle 101681 1 0 Aug12 ? 00:56:55 ora_pmon_DBC2 oracle 179079 1 0 Sep10 ? 00:35:17 ora_pmon_DBD2 oracle 189585 1 0 Nov01 ? 00:09:34 ora_pmon_DB72 |
Indeed, after stopping them, the huge page usage has not changed:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
# [ oracle@oraserver1:/home/oracle [11:01:52] [11.2.0.4.0 [DBMS EE] SID=DB62] 1 ] # $ srvctl stop instance -d DB6_SITE1 -i DB62 # [ oracle@oraserver1:/home/oracle [11:02:24] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl stop instance -d DB4_SITE1 -i DB42 # [ oracle@oraserver1:/home/oracle [11:03:29] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl stop instance -d DB8_SITE1 -i DB82 # [ oracle@oraserver1:/home/oracle [11:06:36] [11.2.0.4.0 [DBMS EE] SID=DB62] 130 ] # $ srvctl stop instance -d DB9_SITE1 -i DB92 # [ oracle@oraserver1:/home/oracle [11:07:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl stop instance -d DBA_SITE1 -i DBA2 # [ oracle@oraserver1:/home/oracle [11:07:56] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl stop instance -d DBB_SITE1 -i DBB2 # [ oracle@oraserver1:/home/oracle [11:08:42] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl stop instance -d DBC_SITE1 -i DBC2 # [ oracle@oraserver1:/home/oracle [11:09:16] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ cat /proc/meminfo | grep Huge HugePages_Total: 180000 HugePages_Free: 86029 HugePages_Rsvd: 11507 HugePages_Surp: 0 Hugepagesize: 2048 kB |
But after starting them back I could see the new huge pages reserved/allocated:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# [ oracle@oraserver1:/home/oracle [11:10:35] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DB6_SITE1 -i DB62 # [ oracle@oraserver1:/home/oracle [11:12:14] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DB4_SITE1 -i DB42 # [ oracle@oraserver1:/home/oracle [11:12:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DB8_SITE1 -i DB82 # [ oracle@oraserver1:/home/oracle [11:13:41] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DB9_SITE1 -i DB92 # [ oracle@oraserver1:/home/oracle [11:14:43] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DBA_SITE1 -i DBA2 # [ oracle@oraserver1:/home/oracle [11:15:25] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DBB_SITE1 -i DBB2 # [ oracle@oraserver1:/home/oracle [11:15:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ srvctl start instance -d DBC_SITE1 -i DBC2 # [ oracle@oraserver1:/home/oracle [11:17:49] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ cat /proc/meminfo | grep Huge HugePages_Total: 180000 HugePages_Free: 72820 HugePages_Rsvd: 68961 HugePages_Surp: 0 Hugepagesize: 2048 kB # [ oracle@oraserver1:/home/oracle [11:17:54] [11.2.0.4.0 [DBMS EE] SID=DB62] 0 ] # $ free total used free shared buff/cache available Mem: 528076056 392011828 123587116 5371848 12477112 126250868 Swap: 16760828 587308 16173520 |
The reason was that the server has been started without huge pages first, and after a few instances started, the huge pages has been set.
HTH
—
Ludovico
Latest posts by Ludovico (see all)
- New views in Oracle Data Guard 23c - January 3, 2024
- New in Data Guard 21c and 23c: Automatic preparation of the primary - December 22, 2023
- Does FLASHBACK QUERY work across incarnations or after a Data Guard failover? - December 13, 2023
Set vm.nr_overcommit_hugepages in /etc/sysctl.conf
and
USE_LARGE_PAGES = ONLY
Thanks Zhwsh,
I did not know about nr_overcommit_hugepages kernel parameter. It is a nice complement to USE_LARGE_PAGES = ONLY.
My post was more intended to explain “how to react in case of wrong configuration” rather than “how to configure huge pages correctly”, but your comment is very valuable 🙂
Great post Ludovico. Anyway, ‘USE_LARGE_PAGES = ONLY’ is a good way to avoid such situation. Startup will fail if the sufficient amount of hugepages are not available.
Thanks for the comment Gabor. You are right, USE_LARGE_PAGES = ONLY would avoid the issue 🙂
Thanks! 🙂
Thanks Ludo, One more time was fun reading your blog on my bus travel to the office. Good work keep blogging your issues.