Mailing List Archive: [PATCH 1/6] x86: Introduce x86_decode

[PATCH 1/6] x86: Introduce x86_decode_lite()

Apr 22, 2024, 11:14 AM

Post #1 of 4 (20 views)

In order to relocate all IP-relative fields in an alternative replacement
block, we need to decode the instructions enough to obtain their length and
any relative fields.

Full x86_decode() is far too heavyweight, so introduce a minimal form which
can make several simplifying assumptions.

This logic is sufficient to decode all alternative blocks that exist in Xen
right now.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
---
xen/arch/x86/x86_emulate/Makefile | 1 +
xen/arch/x86/x86_emulate/decode-lite.c | 245 +++++++++++++++++++++++++
xen/arch/x86/x86_emulate/private.h | 2 +
xen/arch/x86/x86_emulate/x86_emulate.h | 17 ++
4 files changed, 265 insertions(+)
create mode 100644 xen/arch/x86/x86_emulate/decode-lite.c

diff --git a/xen/arch/x86/x86_emulate/Makefile b/xen/arch/x86/x86_emulate/Makefile
index 2e20d65d788a..f06731913d67 100644
--- a/xen/arch/x86/x86_emulate/Makefile
+++ b/xen/arch/x86/x86_emulate/Makefile
@@ -3,6 +3,7 @@ obj-y += 0fae.o
obj-y += 0fc7.o
obj-y += blk.o
obj-y += decode.o
+obj-y += decode-lite.o
obj-$(CONFIG_HVM) += fpu.o
obj-y += util.o
obj-y += util-xen.o
diff --git a/xen/arch/x86/x86_emulate/decode-lite.c b/xen/arch/x86/x86_emulate/decode-lite.c
new file mode 100644
index 000000000000..050b0d8cf4a8
--- /dev/null
+++ b/xen/arch/x86/x86_emulate/decode-lite.c
@@ -0,0 +1,245 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#include "private.h"
+
+#define Imm8 (1 << 0)
+#define Imm (1 << 1)
+#define Branch (1 << 5) /* ... that we care about */
+/* ModRM (1 << 6) */
+#define Known (1 << 7)
+
+#define ALU_OPS \
+ (Known|ModRM), \
+ (Known|ModRM), \
+ (Known|ModRM), \
+ (Known|ModRM), \
+ (Known|Imm8), \
+ (Known|Imm)
+
+static const uint8_t init_or_livepatch_const onebyte[256] = {
+ [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR */
+ [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
+ [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
+ [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
+
+ [0x50 ... 0x5f] = (Known), /* PUSH/POP %reg */
+
+ [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
+ [0x80] = (Known|ModRM|Imm8),
+ [0x81] = (Known|ModRM|Imm),
+
+ [0x83] = (Known|ModRM|Imm8),
+ [0x84 ... 0x8e] = (Known|ModRM), /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
+
+ [0xb0 ... 0xb7] = (Known|Imm8), /* MOV $imm8, %reg */
+ [0xb8 ... 0xbf] = (Known|Imm), /* MOV $imm32, %reg */
+
+ [0xcc] = (Known), /* INT3 */
+ [0xcd] = (Known|Imm8), /* INT $imm8 */
+
+ [0xe8 ... 0xe9] = (Known|Branch|Imm), /* CALL/JMP disp32 */
+ [0xeb] = (Known|Branch|Imm8), /* JMP disp8 */
+
+ [0xf1] = (Known), /* ICEBP */
+ [0xf4] = (Known), /* HLT */
+ [0xf5] = (Known), /* CMC */
+ [0xf6 ... 0xf7] = (Known|ModRM), /* Grp3 */
+ [0xf8 ... 0xfd] = (Known), /* CLC ... STD */
+ [0xfe ... 0xff] = (Known|ModRM), /* Grp4 */
+};
+static const uint8_t init_or_livepatch_const twobyte[256] = {
+ [0x00 ... 0x01] = (Known|ModRM), /* Grp6/Grp7 */
+
+ [0x18 ... 0x1f] = (Known|ModRM), /* Grp16 (Hint Nop) */
+
+ [0x20 ... 0x23] = (Known|ModRM), /* MOV CR/DR */
+
+ [0x30 ... 0x34] = (Known), /* WRMSR ... RDPMC */
+ [0x40 ... 0x4f] = (Known|ModRM), /* CMOVcc */
+
+ [0x80 ... 0x8f] = (Known|Branch|Imm), /* Jcc disp32 */
+ [0x90 ... 0x9f] = (Known|ModRM), /* SETcc */
+
+ [0xab] = (Known|ModRM), /* BTS */
+ [0xac] = (Known|ModRM|Imm8), /* SHRD $imm8 */
+ [0xad ... 0xaf] = (Known|ModRM), /* SHRD %cl / Grp15 / IMUL */
+
+ [0xb8 ... 0xb9] = (Known|ModRM), /* POPCNT/Grp10 (UD1) */
+ [0xba] = (Known|ModRM|Imm8), /* Grp8 */
+ [0xbb ... 0xbf] = (Known|ModRM), /* BSR/BSF/BSR/MOVSX */
+};
+
+/*
+ * Bare minimum x86 instruction decoder to parse the alternative replacement
+ * instructions and locate the IP-relative references that may need updating.
+ *
+ * These are:
+ * - disp8/32 from near branches
+ * - RIP-relative memory references
+ *
+ * The following simplifications are used:
+ * - All code is 64bit, and the instruction stream is safe to read.
+ * - The 67 prefix is not implemented, so the address size is only 64bit.
+ *
+ * Inputs:
+ * @ip The position to start decoding from.
+ * @end End of the replacement block. Exceeding this is considered an error.
+ *
+ * Returns: x86_decode_lite_t
+ * - On failure, length of -1.
+ * - On success, length > 0 and REL_TYPE_*. For REL_TYPE != NONE, rel points
+ * at the relative field in the instruction stream.
+ */
+x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
+{
+ void *start = ip, *rel = NULL;
+ unsigned int opc, type = REL_TYPE_NONE;
+ uint8_t b, d, osize = 4;
+
+#define OPC_TWOBYTE (1 << 8)
+
+ /* Mutates IP, uses END. */
+#define FETCH(ty) \
+ ({ \
+ ty _val; \
+ \
+ if ( (ip + sizeof(ty)) > end ) \
+ goto overrun; \
+ _val = *(ty *)ip; \
+ ip += sizeof(ty); \
+ _val; \
+ })
+
+ for ( ;; ) /* Prefixes */
+ {
+ switch ( b = FETCH(uint8_t) )
+ {
+ case 0x26: /* ES override */
+ case 0x2e: /* CS override */
+ case 0x36: /* DS override */
+ case 0x3e: /* SS override */
+ case 0x64: /* FS override */
+ case 0x65: /* GS override */
+ case 0xf0: /* Lock */
+ case 0xf2: /* REPNE */
+ case 0xf3: /* REP */
+ break;
+
+ case 0x66: /* Operand size override */
+ osize = 2;
+ break;
+
+ /* case 0x67: Address size override, not implemented */
+
+ case 0x40 ... 0x4f: /* REX */
+ continue;
+
+ default:
+ goto prefixes_done;
+ }
+ }
+ prefixes_done:
+
+ /* Fetch the main opcode byte(s) */
+ if ( b == 0x0f )
+ {
+ b = FETCH(uint8_t);
+ opc = OPC_TWOBYTE | b;
+
+ d = twobyte[b];
+ }
+ else
+ {
+ opc = b;
+ d = onebyte[b];
+ }
+
+ if ( unlikely(!(d & Known)) )
+ goto unknown;
+
+ if ( d & ModRM )
+ {
+ uint8_t modrm = FETCH(uint8_t);
+ uint8_t mod = modrm >> 6;
+ uint8_t reg = (modrm >> 3) & 7;
+ uint8_t rm = modrm & 7;
+
+ /* ModRM/SIB decode */
+ if ( mod == 0 && rm == 5 ) /* RIP relative */
+ {
+ rel = ip;
+ type = REL_TYPE_d32;
+ FETCH(uint32_t);
+ }
+ else if ( mod != 3 && rm == 4 ) /* SIB */
+ FETCH(uint8_t);
+
+ if ( mod == 1 ) /* disp8 */
+ FETCH(uint8_t);
+ else if ( mod == 2 ) /* disp32 */
+ FETCH(uint32_t);
+
+ /* ModRM based decode adjustements */
+ switch ( opc )
+ {
+ case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
+ if ( reg == 0 || reg == 1 )
+ d |= Imm8;
+ break;
+ case 0xf7: /* Grp3 TEST(s) have extra Imm */
+ if ( reg == 0 || reg == 1 )
+ d |= Imm;
+ break;
+ }
+ }
+
+ if ( d & Branch )
+ {
+ /*
+ * Don't tolerate 66-prefixed call/jmp in alternatives. Some are
+ * genuinely decoded differently between Intel and AMD CPUs.
+ */
+ if ( osize != 4 )
+ goto bad_osize;
+
+ rel = ip;
+ type = (d & Imm8) ? REL_TYPE_d8 : REL_TYPE_d32;
+ }
+
+ if ( d & (Imm|Imm8) )
+ {
+ if ( d & Imm8 )
+ osize = 1;
+
+ switch ( osize )
+ {
+ case 1: FETCH(uint8_t); break;
+ case 2: FETCH(uint16_t); break;
+ case 4: FETCH(uint32_t); break;
+ default: goto bad_osize;
+ }
+ }
+
+ return (x86_decode_lite_t){ ip - start, type, rel };
+
+ bad_osize:
+ printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
+ __func__, osize,
+ (int)(unsigned long)(end - start), start);
+ return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
+
+ unknown:
+ printk(XENLOG_ERR "%s() Unknown opcode in %*ph <%02x> %*ph\n",
+ __func__,
+ (int)(unsigned long)(ip - 1 - start), start, b,
+ (int)(unsigned long)(end - ip), ip);
+ return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
+
+ overrun:
+ printk(XENLOG_ERR "%s() Decode overrun, got %*ph\n",
+ __func__,
+ (int)(unsigned long)(end - start), start);
+ return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
+
+#undef FETCH
+}
diff --git a/xen/arch/x86/x86_emulate/private.h b/xen/arch/x86/x86_emulate/private.h
index 0fa26ba00aa5..792b04889ce6 100644
--- a/xen/arch/x86/x86_emulate/private.h
+++ b/xen/arch/x86/x86_emulate/private.h
@@ -9,7 +9,9 @@
#ifdef __XEN__

# include <xen/bug.h>
+# include <xen/init.h>
# include <xen/kernel.h>
+# include <xen/livepatch.h>
# include <asm/endbr.h>
# include <asm/msr-index.h>
# include <asm/x86-vendors.h>
diff --git a/xen/arch/x86/x86_emulate/x86_emulate.h b/xen/arch/x86/x86_emulate/x86_emulate.h
index d92be69d84d9..b7cbcd8b70ba 100644
--- a/xen/arch/x86/x86_emulate/x86_emulate.h
+++ b/xen/arch/x86/x86_emulate/x86_emulate.h
@@ -842,4 +842,21 @@ static inline void x86_emul_reset_event(struct x86_emulate_ctxt *ctxt)
ctxt->event = (struct x86_event){};
}

+/*
+ * x86_decode_lite(). Very minimal decoder for managing alternatives.
+ *
+ * @len is -1 on error, or positive on success. If the instruction has a
+ * relative field, REL_TYPE_* gives the size, and @rel points at the field.
+ */
+typedef struct {
+ int8_t len;
+ uint8_t rel_type;
+#define REL_TYPE_NONE 0
+#define REL_TYPE_d8 1
+#define REL_TYPE_d32 2
+ void *rel;
+} x86_decode_lite_t;
+
+x86_decode_lite_t x86_decode_lite(void *ip, void *end);
+
#endif /* __X86_EMULATE_H__ */
--
2.30.2

Re: [PATCH 1/6] x86: Introduce x86_decode_lite() [ In reply to ]

jbeulich at suse

Apr 23, 2024, 2:17 AM

Post #2 of 4 (20 views)

Permalink

On 22.04.2024 20:14, Andrew Cooper wrote:
> In order to relocate all IP-relative fields in an alternative replacement
> block, we need to decode the instructions enough to obtain their length and
> any relative fields.
>
> Full x86_decode() is far too heavyweight, so introduce a minimal form which
> can make several simplifying assumptions.
>
> This logic is sufficient to decode all alternative blocks that exist in Xen
> right now.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> --- a/xen/arch/x86/x86_emulate/Makefile
> +++ b/xen/arch/x86/x86_emulate/Makefile
> @@ -3,6 +3,7 @@ obj-y += 0fae.o
> obj-y += 0fc7.o
> obj-y += blk.o
> obj-y += decode.o
> +obj-y += decode-lite.o

When LIVEPATCH=n, this could do with conversion to *.init.o, like is
done in the parent directory for alternative.c as well.

> --- /dev/null
> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
> @@ -0,0 +1,245 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +
> +#include "private.h"
> +
> +#define Imm8 (1 << 0)
> +#define Imm (1 << 1)
> +#define Branch (1 << 5) /* ... that we care about */
> +/* ModRM (1 << 6) */
> +#define Known (1 << 7)
> +
> +#define ALU_OPS \
> + (Known|ModRM), \
> + (Known|ModRM), \
> + (Known|ModRM), \
> + (Known|ModRM), \
> + (Known|Imm8), \
> + (Known|Imm)
> +
> +static const uint8_t init_or_livepatch_const onebyte[256] = {
> + [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR */
> + [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
> + [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
> + [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
> +
> + [0x50 ... 0x5f] = (Known), /* PUSH/POP %reg */
> +
> + [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
> + [0x80] = (Known|ModRM|Imm8),
> + [0x81] = (Known|ModRM|Imm),
> +
> + [0x83] = (Known|ModRM|Imm8),
> + [0x84 ... 0x8e] = (Known|ModRM), /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
> +
> + [0xb0 ... 0xb7] = (Known|Imm8), /* MOV $imm8, %reg */

I'm surprised you get away without at least NOP also marked as known.
Imo the whole 0x90 row should be marked so.

Similarly I'm not convinced that leaving the 0xA0 row unpopulated is a
good idea. It's at best a trap set up for somebody to fall into rather
sooner than later.

> + [0xb8 ... 0xbf] = (Known|Imm), /* MOV $imm32, %reg */
> +
> + [0xcc] = (Known), /* INT3 */
> + [0xcd] = (Known|Imm8), /* INT $imm8 */

Like above, what about in particular any of the shifts/rotates and the
MOV that's in the 0xC0 row?

While the last sentence in the description is likely meant to cover
that, I think the description wants to go further as to any pretty
common but omitted insns. Already "x86: re-work memset()" and "x86: re-
work memcpy()" (v2 pending for, soon, 3 years) would make it necessary
to touch this table, thus increasing complexity of those changes to an
area they shouldn't be concerned about at all.

> + [0xe8 ... 0xe9] = (Known|Branch|Imm), /* CALL/JMP disp32 */
> + [0xeb] = (Known|Branch|Imm8), /* JMP disp8 */

0xe0 ... 0xe7 and 0xec ... 0xef would imo also better be covered, as
they easily can be (much like you also cover e.g. CMC despite it
appearing pretty unlikely that we use that insn anywhere).

> + [0xf1] = (Known), /* ICEBP */
> + [0xf4] = (Known), /* HLT */
> + [0xf5] = (Known), /* CMC */
> + [0xf6 ... 0xf7] = (Known|ModRM), /* Grp3 */
> + [0xf8 ... 0xfd] = (Known), /* CLC ... STD */
> + [0xfe ... 0xff] = (Known|ModRM), /* Grp4 */
> +};
> +static const uint8_t init_or_livepatch_const twobyte[256] = {
> + [0x00 ... 0x01] = (Known|ModRM), /* Grp6/Grp7 */

LAR / LSL? CLTS? WBINVD? UD2?

> + [0x18 ... 0x1f] = (Known|ModRM), /* Grp16 (Hint Nop) */
> +
> + [0x20 ... 0x23] = (Known|ModRM), /* MOV CR/DR */
> +
> + [0x30 ... 0x34] = (Known), /* WRMSR ... RDPMC */

0x34 is SYSENTER, isn't it?

> + [0x40 ... 0x4f] = (Known|ModRM), /* CMOVcc */
> +
> + [0x80 ... 0x8f] = (Known|Branch|Imm), /* Jcc disp32 */

What about things like VMREAD/VMWRITE?

> + [0x90 ... 0x9f] = (Known|ModRM), /* SETcc */

PUSH/POP fs/gs? CPUID?

> + [0xab] = (Known|ModRM), /* BTS */

BTS (and BTC below) but not BT and BTR?

> + [0xac] = (Known|ModRM|Imm8), /* SHRD $imm8 */
> + [0xad ... 0xaf] = (Known|ModRM), /* SHRD %cl / Grp15 / IMUL */

SHRD but not SHLD?

CMPXCHG?

> + [0xb8 ... 0xb9] = (Known|ModRM), /* POPCNT/Grp10 (UD1) */
> + [0xba] = (Known|ModRM|Imm8), /* Grp8 */
> + [0xbb ... 0xbf] = (Known|ModRM), /* BSR/BSF/BSR/MOVSX */

Nit (comment only): 0xbb is BTC.

MOVSX but not MOVZX and also not MOVSXD (in the 1-byte table)?

> +};

XADD, MOVNTI, and the whole 0xc7-based group?

> +/*
> + * Bare minimum x86 instruction decoder to parse the alternative replacement
> + * instructions and locate the IP-relative references that may need updating.
> + *
> + * These are:
> + * - disp8/32 from near branches
> + * - RIP-relative memory references
> + *
> + * The following simplifications are used:
> + * - All code is 64bit, and the instruction stream is safe to read.

This, in particular, is imo important to state already ahead of the tables
above, to clarify why e.g. row 0x40 is unpopulated.

> + * - The 67 prefix is not implemented, so the address size is only 64bit.
> + *
> + * Inputs:
> + * @ip The position to start decoding from.
> + * @end End of the replacement block. Exceeding this is considered an error.
> + *
> + * Returns: x86_decode_lite_t
> + * - On failure, length of -1.
> + * - On success, length > 0 and REL_TYPE_*. For REL_TYPE != NONE, rel points
> + * at the relative field in the instruction stream.
> + */
> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
> +{
> + void *start = ip, *rel = NULL;
> + unsigned int opc, type = REL_TYPE_NONE;
> + uint8_t b, d, osize = 4;
> +
> +#define OPC_TWOBYTE (1 << 8)
> +
> + /* Mutates IP, uses END. */
> +#define FETCH(ty) \
> + ({ \
> + ty _val; \
> + \
> + if ( (ip + sizeof(ty)) > end ) \
> + goto overrun; \
> + _val = *(ty *)ip; \
> + ip += sizeof(ty); \
> + _val; \
> + })
> +
> + for ( ;; ) /* Prefixes */
> + {
> + switch ( b = FETCH(uint8_t) )
> + {
> + case 0x26: /* ES override */
> + case 0x2e: /* CS override */
> + case 0x36: /* DS override */
> + case 0x3e: /* SS override */
> + case 0x64: /* FS override */
> + case 0x65: /* GS override */
> + case 0xf0: /* Lock */
> + case 0xf2: /* REPNE */
> + case 0xf3: /* REP */
> + break;
> +
> + case 0x66: /* Operand size override */
> + osize = 2;
> + break;
> +
> + /* case 0x67: Address size override, not implemented */
> +
> + case 0x40 ... 0x4f: /* REX */
> + continue;

Imo at least a comment is needed as to osize here: We don't use 0x66
followed by REX.W, I suppose, when 0x66 determines operand size. It
may also be an opcode extension, though, in which case osize set to
2 is (latently) wrong. "Latently" because all you need osize for is
to determine Imm's length.

However, what I again think need covering right away are opcodes
0xb8 ... 0xbc in combination with REX.W (osize needing to be 8 there).

Finally - why "continue" here, but "break" further up? Both (right
now) have exactly the same effect.

> + default:
> + goto prefixes_done;
> + }
> + }
> + prefixes_done:
> +
> + /* Fetch the main opcode byte(s) */
> + if ( b == 0x0f )
> + {
> + b = FETCH(uint8_t);
> + opc = OPC_TWOBYTE | b;
> +
> + d = twobyte[b];
> + }
> + else
> + {
> + opc = b;
> + d = onebyte[b];
> + }

IOW GPR insns in 0f38 and 0f3a spaces are left out, too. That's perhaps
less of an issue than some of the other omissions (and would be more
involved to cover when considering that some of them are VEX-encoded),
but still not ideal.

> + if ( unlikely(!(d & Known)) )
> + goto unknown;
> +
> + if ( d & ModRM )
> + {
> + uint8_t modrm = FETCH(uint8_t);
> + uint8_t mod = modrm >> 6;
> + uint8_t reg = (modrm >> 3) & 7;
> + uint8_t rm = modrm & 7;
> +
> + /* ModRM/SIB decode */
> + if ( mod == 0 && rm == 5 ) /* RIP relative */
> + {
> + rel = ip;
> + type = REL_TYPE_d32;
> + FETCH(uint32_t);
> + }
> + else if ( mod != 3 && rm == 4 ) /* SIB */
> + FETCH(uint8_t);
> +
> + if ( mod == 1 ) /* disp8 */
> + FETCH(uint8_t);
> + else if ( mod == 2 ) /* disp32 */
> + FETCH(uint32_t);
> +
> + /* ModRM based decode adjustements */
> + switch ( opc )
> + {
> + case 0xf6: /* Grp3 TEST(s) have extra Imm8 */
> + if ( reg == 0 || reg == 1 )
> + d |= Imm8;
> + break;
> + case 0xf7: /* Grp3 TEST(s) have extra Imm */
> + if ( reg == 0 || reg == 1 )
> + d |= Imm;
> + break;
> + }
> + }
> +
> + if ( d & Branch )
> + {
> + /*
> + * Don't tolerate 66-prefixed call/jmp in alternatives. Some are
> + * genuinely decoded differently between Intel and AMD CPUs.
> + */
> + if ( osize != 4 )
> + goto bad_osize;
> +
> + rel = ip;
> + type = (d & Imm8) ? REL_TYPE_d8 : REL_TYPE_d32;

Perhaps ASSERT(!rel) and/or ASSERT(!type) ahead of these?

> + }
> +
> + if ( d & (Imm|Imm8) )
> + {
> + if ( d & Imm8 )
> + osize = 1;
> +
> + switch ( osize )
> + {
> + case 1: FETCH(uint8_t); break;
> + case 2: FETCH(uint16_t); break;
> + case 4: FETCH(uint32_t); break;
> + default: goto bad_osize;
> + }
> + }
> +
> + return (x86_decode_lite_t){ ip - start, type, rel };
> +
> + bad_osize:
> + printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
> + __func__, osize,
> + (int)(unsigned long)(end - start), start);
> + return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };

Maybe limit opcode quoting to ip - start here?

> --- a/xen/arch/x86/x86_emulate/private.h
> +++ b/xen/arch/x86/x86_emulate/private.h
> @@ -9,7 +9,9 @@
> #ifdef __XEN__
>
> # include <xen/bug.h>
> +# include <xen/init.h>
> # include <xen/kernel.h>
> +# include <xen/livepatch.h>

Are both of these really needed here, rather than just in decode-lite.c?

> --- a/xen/arch/x86/x86_emulate/x86_emulate.h
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.h
> @@ -842,4 +842,21 @@ static inline void x86_emul_reset_event(struct x86_emulate_ctxt *ctxt)
> ctxt->event = (struct x86_event){};
> }
>
> +/*
> + * x86_decode_lite(). Very minimal decoder for managing alternatives.
> + *
> + * @len is -1 on error, or positive on success. If the instruction has a
> + * relative field, REL_TYPE_* gives the size, and @rel points at the field.
> + */
> +typedef struct {
> + int8_t len;

Can't 0 be "error", thus permitting the field to be uint8_t?

> + uint8_t rel_type;
> +#define REL_TYPE_NONE 0
> +#define REL_TYPE_d8 1
> +#define REL_TYPE_d32 2

Why not call the field rel_size and truly pass back the size?

> + void *rel;

While I understand the goal of omitting const here and ...

> +} x86_decode_lite_t;
> +
> +x86_decode_lite_t x86_decode_lite(void *ip, void *end);

... here, I still find this fragile / misleading (the function itself, after
all, only ever fetches from memory). Even with the goal in mind, surely at
least "end" can be pointer-to-const?

The (struct) return type would also be easier for the compiler to deal
with if it didn't have a pointer field (and hence needs to be 128-bit). How
about returning an offset relative to "start"? That would then allow proper
constifying of both function parameters as well.

Jan

Re: [PATCH 1/6] x86: Introduce x86_decode_lite() [ In reply to ]

andrew.cooper3 at citrix

Apr 23, 2024, 3:27 AM

Post #3 of 4 (20 views)

Permalink

On 23/04/2024 10:17 am, Jan Beulich wrote:
> On 22.04.2024 20:14, Andrew Cooper wrote:
>> --- /dev/null
>> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
>> @@ -0,0 +1,245 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +
>> +#include "private.h"
>> +
>> +#define Imm8 (1 << 0)
>> +#define Imm (1 << 1)
>> +#define Branch (1 << 5) /* ... that we care about */
>> +/* ModRM (1 << 6) */
>> +#define Known (1 << 7)
>> +
>> +#define ALU_OPS \
>> + (Known|ModRM), \
>> + (Known|ModRM), \
>> + (Known|ModRM), \
>> + (Known|ModRM), \
>> + (Known|Imm8), \
>> + (Known|Imm)
>> +
>> +static const uint8_t init_or_livepatch_const onebyte[256] = {
>> + [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR */
>> + [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
>> + [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
>> + [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
>> +
>> + [0x50 ... 0x5f] = (Known), /* PUSH/POP %reg */
>> +
>> + [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
>> + [0x80] = (Known|ModRM|Imm8),
>> + [0x81] = (Known|ModRM|Imm),
>> +
>> + [0x83] = (Known|ModRM|Imm8),
>> + [0x84 ... 0x8e] = (Known|ModRM), /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
>> +
>> + [0xb0 ... 0xb7] = (Known|Imm8), /* MOV $imm8, %reg */
> I'm surprised you get away without at least NOP also marked as known.
> Imo the whole 0x90 row should be marked so.
>
> Similarly I'm not convinced that leaving the 0xA0 row unpopulated is a
> good idea. It's at best a trap set up for somebody to fall into rather
> sooner than later.
>
>> + [0xb8 ... 0xbf] = (Known|Imm), /* MOV $imm32, %reg */
>> +
>> + [0xcc] = (Known), /* INT3 */
>> + [0xcd] = (Known|Imm8), /* INT $imm8 */
> Like above, what about in particular any of the shifts/rotates and the
> MOV that's in the 0xC0 row?
>
> While the last sentence in the description is likely meant to cover
> that, I think the description wants to go further as to any pretty
> common but omitted insns. Already "x86: re-work memset()" and "x86: re-
> work memcpy()" (v2 pending for, soon, 3 years) would make it necessary
> to touch this table, thus increasing complexity of those changes to an
> area they shouldn't be concerned about at all.
>
>> + [0xe8 ... 0xe9] = (Known|Branch|Imm), /* CALL/JMP disp32 */
>> + [0xeb] = (Known|Branch|Imm8), /* JMP disp8 */
> 0xe0 ... 0xe7 and 0xec ... 0xef would imo also better be covered, as
> they easily can be (much like you also cover e.g. CMC despite it
> appearing pretty unlikely that we use that insn anywhere).
>
>> + [0xf1] = (Known), /* ICEBP */
>> + [0xf4] = (Known), /* HLT */
>> + [0xf5] = (Known), /* CMC */
>> + [0xf6 ... 0xf7] = (Known|ModRM), /* Grp3 */
>> + [0xf8 ... 0xfd] = (Known), /* CLC ... STD */
>> + [0xfe ... 0xff] = (Known|ModRM), /* Grp4 */
>> +};
>> +static const uint8_t init_or_livepatch_const twobyte[256] = {
>> + [0x00 ... 0x01] = (Known|ModRM), /* Grp6/Grp7 */
> LAR / LSL? CLTS? WBINVD? UD2?
>
>> + [0x18 ... 0x1f] = (Known|ModRM), /* Grp16 (Hint Nop) */
>> +
>> + [0x20 ... 0x23] = (Known|ModRM), /* MOV CR/DR */
>> +
>> + [0x30 ... 0x34] = (Known), /* WRMSR ... RDPMC */
> 0x34 is SYSENTER, isn't it?
>
>> + [0x40 ... 0x4f] = (Known|ModRM), /* CMOVcc */
>> +
>> + [0x80 ... 0x8f] = (Known|Branch|Imm), /* Jcc disp32 */
> What about things like VMREAD/VMWRITE?
>
>> + [0x90 ... 0x9f] = (Known|ModRM), /* SETcc */
> PUSH/POP fs/gs? CPUID?
>
>> + [0xab] = (Known|ModRM), /* BTS */
> BTS (and BTC below) but not BT and BTR?
>
>> + [0xac] = (Known|ModRM|Imm8), /* SHRD $imm8 */
>> + [0xad ... 0xaf] = (Known|ModRM), /* SHRD %cl / Grp15 / IMUL */
> SHRD but not SHLD?
>
> CMPXCHG?
>
>> + [0xb8 ... 0xb9] = (Known|ModRM), /* POPCNT/Grp10 (UD1) */
>> + [0xba] = (Known|ModRM|Imm8), /* Grp8 */
>> + [0xbb ... 0xbf] = (Known|ModRM), /* BSR/BSF/BSR/MOVSX */
> Nit (comment only): 0xbb is BTC.
>
> MOVSX but not MOVZX and also not MOVSXD (in the 1-byte table)?
>
>> +};
> XADD, MOVNTI, and the whole 0xc7-based group?

As you may have guessed, I filled in the opcode table until I could
parse all replacements.

When, at the end of this, I didn't need the osize=8 movs, I took the
decoding complexity back out.

In an ideal world I would have finished the userspace harness too, but I
ran out of time. Where it's just populating the table, that's fine.
Where it complicates the decoding logic, that needs more thought.
>> +/*
>> + * Bare minimum x86 instruction decoder to parse the alternative replacement
>> + * instructions and locate the IP-relative references that may need updating.
>> + *
>> + * These are:
>> + * - disp8/32 from near branches
>> + * - RIP-relative memory references
>> + *
>> + * The following simplifications are used:
>> + * - All code is 64bit, and the instruction stream is safe to read.
> This, in particular, is imo important to state already ahead of the tables
> above, to clarify why e.g. row 0x40 is unpopulated.

I initially had that, but took it out. I can put it back in.

>
>> + * - The 67 prefix is not implemented, so the address size is only 64bit.
>> + *
>> + * Inputs:
>> + * @ip The position to start decoding from.
>> + * @end End of the replacement block. Exceeding this is considered an error.
>> + *
>> + * Returns: x86_decode_lite_t
>> + * - On failure, length of -1.
>> + * - On success, length > 0 and REL_TYPE_*. For REL_TYPE != NONE, rel points
>> + * at the relative field in the instruction stream.
>> + */
>> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
>> +{
>> + void *start = ip, *rel = NULL;
>> + unsigned int opc, type = REL_TYPE_NONE;
>> + uint8_t b, d, osize = 4;
>> +
>> +#define OPC_TWOBYTE (1 << 8)
>> +
>> + /* Mutates IP, uses END. */
>> +#define FETCH(ty) \
>> + ({ \
>> + ty _val; \
>> + \
>> + if ( (ip + sizeof(ty)) > end ) \
>> + goto overrun; \
>> + _val = *(ty *)ip; \
>> + ip += sizeof(ty); \
>> + _val; \
>> + })
>> +
>> + for ( ;; ) /* Prefixes */
>> + {
>> + switch ( b = FETCH(uint8_t) )
>> + {
>> + case 0x26: /* ES override */
>> + case 0x2e: /* CS override */
>> + case 0x36: /* DS override */
>> + case 0x3e: /* SS override */
>> + case 0x64: /* FS override */
>> + case 0x65: /* GS override */
>> + case 0xf0: /* Lock */
>> + case 0xf2: /* REPNE */
>> + case 0xf3: /* REP */
>> + break;
>> +
>> + case 0x66: /* Operand size override */
>> + osize = 2;
>> + break;
>> +
>> + /* case 0x67: Address size override, not implemented */
>> +
>> + case 0x40 ... 0x4f: /* REX */
>> + continue;
> Imo at least a comment is needed as to osize here: We don't use 0x66
> followed by REX.W, I suppose, when 0x66 determines operand size. It
> may also be an opcode extension, though, in which case osize set to
> 2 is (latently) wrong. "Latently" because all you need osize for is
> to determine Imm's length.
>
> However, what I again think need covering right away are opcodes
> 0xb8 ... 0xbc in combination with REX.W (osize needing to be 8 there).
>
> Finally - why "continue" here, but "break" further up? Both (right
> now) have exactly the same effect.

They're not the same when ...

>
>> + default:
>> + goto prefixes_done;
>> + }

... this has "cancel the REX prefix" in it.

I started by decoding REX, only to find I didn't need it, so took it
back out.

>> + }
>> + prefixes_done:
>> +
>> + /* Fetch the main opcode byte(s) */
>> + if ( b == 0x0f )
>> + {
>> + b = FETCH(uint8_t);
>> + opc = OPC_TWOBYTE | b;
>> +
>> + d = twobyte[b];
>> + }
>> + else
>> + {
>> + opc = b;
>> + d = onebyte[b];
>> + }
> IOW GPR insns in 0f38 and 0f3a spaces are left out, too. That's perhaps
> less of an issue than some of the other omissions (and would be more
> involved to cover when considering that some of them are VEX-encoded),
> but still not ideal.

They can all be added if needed, but right now they're not.

This decoder only need to cover instructions likely to be used in
alternatives, and that pretty limits us to simple integer operations.

Any extra complexity here makes the function less and less "lite".
>> + }
>> +
>> + if ( d & (Imm|Imm8) )
>> + {
>> + if ( d & Imm8 )
>> + osize = 1;
>> +
>> + switch ( osize )
>> + {
>> + case 1: FETCH(uint8_t); break;
>> + case 2: FETCH(uint16_t); break;
>> + case 4: FETCH(uint32_t); break;
>> + default: goto bad_osize;
>> + }
>> + }
>> +
>> + return (x86_decode_lite_t){ ip - start, type, rel };
>> +
>> + bad_osize:
>> + printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
>> + __func__, osize,
>> + (int)(unsigned long)(end - start), start);
>> + return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
> Maybe limit opcode quoting to ip - start here?

In the case that we've taken the bad_osize path, we've not decoded the
full instruction. The bytes beyond ip are useful for diagnostics.

>
>> --- a/xen/arch/x86/x86_emulate/private.h
>> +++ b/xen/arch/x86/x86_emulate/private.h
>> @@ -9,7 +9,9 @@
>> #ifdef __XEN__
>>
>> # include <xen/bug.h>
>> +# include <xen/init.h>
>> # include <xen/kernel.h>
>> +# include <xen/livepatch.h>
> Are both of these really needed here, rather than just in decode-lite.c?

Yes, for the userpsace harness.

>
>> --- a/xen/arch/x86/x86_emulate/x86_emulate.h
>> +++ b/xen/arch/x86/x86_emulate/x86_emulate.h
>> @@ -842,4 +842,21 @@ static inline void x86_emul_reset_event(struct x86_emulate_ctxt *ctxt)
>> ctxt->event = (struct x86_event){};
>> }
>>
>> +/*
>> + * x86_decode_lite(). Very minimal decoder for managing alternatives.
>> + *
>> + * @len is -1 on error, or positive on success. If the instruction has a
>> + * relative field, REL_TYPE_* gives the size, and @rel points at the field.
>> + */
>> +typedef struct {
>> + int8_t len;
> Can't 0 be "error", thus permitting the field to be uint8_t?

Hmm, yes I think it can. -1 is a leftover from an older scheme.

>
>> + uint8_t rel_type;
>> +#define REL_TYPE_NONE 0
>> +#define REL_TYPE_d8 1
>> +#define REL_TYPE_d32 2
> Why not call the field rel_size and truly pass back the size?

Ok.

>
>> + void *rel;
> While I understand the goal of omitting const here and ...
>
>> +} x86_decode_lite_t;
>> +
>> +x86_decode_lite_t x86_decode_lite(void *ip, void *end);
> ... here, I still find this fragile / misleading (the function itself, after
> all, only ever fetches from memory). Even with the goal in mind, surely at
> least "end" can be pointer-to-const?
>
> The (struct) return type would also be easier for the compiler to deal
> with if it didn't have a pointer field (and hence needs to be 128-bit). How
> about returning an offset relative to "start"? That would then allow proper
> constifying of both function parameters as well.

Quite the contrary.

I did initially pack it all into a single GPR, but both the written C
and code generation of this form is better.

This is what the code generation looks like:

<xdl>:
55                      push   %rbp
48 89 e5                mov    %rsp,%rbp
e8 77 fd ff ff          callq ffff82d04033c580 <x86_decode_lite>
5d                      pop    %rbp
88 05 01 77 2f 00       mov    %al,0x2f7701(%rip)        # <xdl_len>
0f b6 c4                movzbl %ah,%eax
88 05 f7 76 2f 00       mov    %al,0x2f76f7(%rip)        # <xdl_type>
48 89 15 e8 76 2f 00    mov    %rdx,0x2f76e8(%rip)       # <xdl_rel>
c3                      retq

and keeping rel as a full pointer simplifies both sides of the function.

(Again, I have no idea why GCC is setting up a frame pointer in this
case . It's not needed given our preferred alignment setting.)

~Andrew

Re: [PATCH 1/6] x86: Introduce x86_decode_lite() [ In reply to ]

jbeulich at suse

Apr 23, 2024, 3:46 AM

Post #4 of 4 (20 views)

Permalink

On 23.04.2024 12:27, Andrew Cooper wrote:
> On 23/04/2024 10:17 am, Jan Beulich wrote:
>> On 22.04.2024 20:14, Andrew Cooper wrote:
>>> --- /dev/null
>>> +++ b/xen/arch/x86/x86_emulate/decode-lite.c
>>> @@ -0,0 +1,245 @@
>>> +/* SPDX-License-Identifier: GPL-2.0-only */
>>> +
>>> +#include "private.h"
>>> +
>>> +#define Imm8 (1 << 0)
>>> +#define Imm (1 << 1)
>>> +#define Branch (1 << 5) /* ... that we care about */
>>> +/* ModRM (1 << 6) */
>>> +#define Known (1 << 7)
>>> +
>>> +#define ALU_OPS \
>>> + (Known|ModRM), \
>>> + (Known|ModRM), \
>>> + (Known|ModRM), \
>>> + (Known|ModRM), \
>>> + (Known|Imm8), \
>>> + (Known|Imm)
>>> +
>>> +static const uint8_t init_or_livepatch_const onebyte[256] = {
>>> + [0x00] = ALU_OPS, /* ADD */ [0x08] = ALU_OPS, /* OR */
>>> + [0x10] = ALU_OPS, /* ADC */ [0x18] = ALU_OPS, /* SBB */
>>> + [0x20] = ALU_OPS, /* AND */ [0x28] = ALU_OPS, /* SUB */
>>> + [0x30] = ALU_OPS, /* XOR */ [0x38] = ALU_OPS, /* CMP */
>>> +
>>> + [0x50 ... 0x5f] = (Known), /* PUSH/POP %reg */
>>> +
>>> + [0x70 ... 0x7f] = (Known|Branch|Imm8), /* Jcc disp8 */
>>> + [0x80] = (Known|ModRM|Imm8),
>>> + [0x81] = (Known|ModRM|Imm),
>>> +
>>> + [0x83] = (Known|ModRM|Imm8),
>>> + [0x84 ... 0x8e] = (Known|ModRM), /* TEST/XCHG/MOV/MOV-SREG/LEA r/rm */
>>> +
>>> + [0xb0 ... 0xb7] = (Known|Imm8), /* MOV $imm8, %reg */
>> I'm surprised you get away without at least NOP also marked as known.
>> Imo the whole 0x90 row should be marked so.
>>
>> Similarly I'm not convinced that leaving the 0xA0 row unpopulated is a
>> good idea. It's at best a trap set up for somebody to fall into rather
>> sooner than later.
>>
>>> + [0xb8 ... 0xbf] = (Known|Imm), /* MOV $imm32, %reg */
>>> +
>>> + [0xcc] = (Known), /* INT3 */
>>> + [0xcd] = (Known|Imm8), /* INT $imm8 */
>> Like above, what about in particular any of the shifts/rotates and the
>> MOV that's in the 0xC0 row?
>>
>> While the last sentence in the description is likely meant to cover
>> that, I think the description wants to go further as to any pretty
>> common but omitted insns. Already "x86: re-work memset()" and "x86: re-
>> work memcpy()" (v2 pending for, soon, 3 years) would make it necessary
>> to touch this table, thus increasing complexity of those changes to an
>> area they shouldn't be concerned about at all.
>>
>>> + [0xe8 ... 0xe9] = (Known|Branch|Imm), /* CALL/JMP disp32 */
>>> + [0xeb] = (Known|Branch|Imm8), /* JMP disp8 */
>> 0xe0 ... 0xe7 and 0xec ... 0xef would imo also better be covered, as
>> they easily can be (much like you also cover e.g. CMC despite it
>> appearing pretty unlikely that we use that insn anywhere).
>>
>>> + [0xf1] = (Known), /* ICEBP */
>>> + [0xf4] = (Known), /* HLT */
>>> + [0xf5] = (Known), /* CMC */
>>> + [0xf6 ... 0xf7] = (Known|ModRM), /* Grp3 */
>>> + [0xf8 ... 0xfd] = (Known), /* CLC ... STD */
>>> + [0xfe ... 0xff] = (Known|ModRM), /* Grp4 */
>>> +};
>>> +static const uint8_t init_or_livepatch_const twobyte[256] = {
>>> + [0x00 ... 0x01] = (Known|ModRM), /* Grp6/Grp7 */
>> LAR / LSL? CLTS? WBINVD? UD2?
>>
>>> + [0x18 ... 0x1f] = (Known|ModRM), /* Grp16 (Hint Nop) */
>>> +
>>> + [0x20 ... 0x23] = (Known|ModRM), /* MOV CR/DR */
>>> +
>>> + [0x30 ... 0x34] = (Known), /* WRMSR ... RDPMC */
>> 0x34 is SYSENTER, isn't it?
>>
>>> + [0x40 ... 0x4f] = (Known|ModRM), /* CMOVcc */
>>> +
>>> + [0x80 ... 0x8f] = (Known|Branch|Imm), /* Jcc disp32 */
>> What about things like VMREAD/VMWRITE?
>>
>>> + [0x90 ... 0x9f] = (Known|ModRM), /* SETcc */
>> PUSH/POP fs/gs? CPUID?
>>
>>> + [0xab] = (Known|ModRM), /* BTS */
>> BTS (and BTC below) but not BT and BTR?
>>
>>> + [0xac] = (Known|ModRM|Imm8), /* SHRD $imm8 */
>>> + [0xad ... 0xaf] = (Known|ModRM), /* SHRD %cl / Grp15 / IMUL */
>> SHRD but not SHLD?
>>
>> CMPXCHG?
>>
>>> + [0xb8 ... 0xb9] = (Known|ModRM), /* POPCNT/Grp10 (UD1) */
>>> + [0xba] = (Known|ModRM|Imm8), /* Grp8 */
>>> + [0xbb ... 0xbf] = (Known|ModRM), /* BSR/BSF/BSR/MOVSX */
>> Nit (comment only): 0xbb is BTC.
>>
>> MOVSX but not MOVZX and also not MOVSXD (in the 1-byte table)?
>>
>>> +};
>> XADD, MOVNTI, and the whole 0xc7-based group?
>
> As you may have guessed, I filled in the opcode table until I could
> parse all replacements.
>
> When, at the end of this, I didn't need the osize=8 movs, I took the
> decoding complexity back out.

While I can see that this requiring extra logic makes it desirable to
leave out, it'll easily be a surprise when, eventually, someone adds an
alternative using such. Please may I ask that for any "simple" integer
insn left out, that be clearly mentioned in or ahead of the tables?

>>> + * - The 67 prefix is not implemented, so the address size is only 64bit.
>>> + *
>>> + * Inputs:
>>> + * @ip The position to start decoding from.
>>> + * @end End of the replacement block. Exceeding this is considered an error.
>>> + *
>>> + * Returns: x86_decode_lite_t
>>> + * - On failure, length of -1.
>>> + * - On success, length > 0 and REL_TYPE_*. For REL_TYPE != NONE, rel points
>>> + * at the relative field in the instruction stream.
>>> + */
>>> +x86_decode_lite_t init_or_livepatch x86_decode_lite(void *ip, void *end)
>>> +{
>>> + void *start = ip, *rel = NULL;
>>> + unsigned int opc, type = REL_TYPE_NONE;
>>> + uint8_t b, d, osize = 4;
>>> +
>>> +#define OPC_TWOBYTE (1 << 8)
>>> +
>>> + /* Mutates IP, uses END. */
>>> +#define FETCH(ty) \
>>> + ({ \
>>> + ty _val; \
>>> + \
>>> + if ( (ip + sizeof(ty)) > end ) \
>>> + goto overrun; \
>>> + _val = *(ty *)ip; \
>>> + ip += sizeof(ty); \
>>> + _val; \
>>> + })
>>> +
>>> + for ( ;; ) /* Prefixes */
>>> + {
>>> + switch ( b = FETCH(uint8_t) )
>>> + {
>>> + case 0x26: /* ES override */
>>> + case 0x2e: /* CS override */
>>> + case 0x36: /* DS override */
>>> + case 0x3e: /* SS override */
>>> + case 0x64: /* FS override */
>>> + case 0x65: /* GS override */
>>> + case 0xf0: /* Lock */
>>> + case 0xf2: /* REPNE */
>>> + case 0xf3: /* REP */
>>> + break;
>>> +
>>> + case 0x66: /* Operand size override */
>>> + osize = 2;
>>> + break;
>>> +
>>> + /* case 0x67: Address size override, not implemented */
>>> +
>>> + case 0x40 ... 0x4f: /* REX */
>>> + continue;
>> Imo at least a comment is needed as to osize here: We don't use 0x66
>> followed by REX.W, I suppose, when 0x66 determines operand size. It
>> may also be an opcode extension, though, in which case osize set to
>> 2 is (latently) wrong. "Latently" because all you need osize for is
>> to determine Imm's length.
>>
>> However, what I again think need covering right away are opcodes
>> 0xb8 ... 0xbc in combination with REX.W (osize needing to be 8 there).
>>
>> Finally - why "continue" here, but "break" further up? Both (right
>> now) have exactly the same effect.
>
> They're not the same when ...
>
>>
>>> + default:
>>> + goto prefixes_done;
>>> + }
>
>
> ... this has "cancel the REX prefix" in it.

Of course.

> I started by decoding REX, only to find I didn't need it, so took it
> back out.

At which point imo they want to all be "break". Once the cancellation
needs adding, where necessary "break" can be switched to "continue".
(Interestingly REX2 is different from REX in this regard, and hence
wouldn't need such cancellation, if ever we end up patching APX insns.)

>>> + }
>>> + prefixes_done:
>>> +
>>> + /* Fetch the main opcode byte(s) */
>>> + if ( b == 0x0f )
>>> + {
>>> + b = FETCH(uint8_t);
>>> + opc = OPC_TWOBYTE | b;
>>> +
>>> + d = twobyte[b];
>>> + }
>>> + else
>>> + {
>>> + opc = b;
>>> + d = onebyte[b];
>>> + }
>> IOW GPR insns in 0f38 and 0f3a spaces are left out, too. That's perhaps
>> less of an issue than some of the other omissions (and would be more
>> involved to cover when considering that some of them are VEX-encoded),
>> but still not ideal.
>
> They can all be added if needed, but right now they're not.
>
> This decoder only need to cover instructions likely to be used in
> alternatives, and that pretty limits us to simple integer operations.
>
> Any extra complexity here makes the function less and less "lite".

I understand this. At the same time to me anything pending and anything
previously submitted but disliked for whatever reason wants at least
considering to cover. In that context please recall that once there was
BMI2 patching (that you didn't like) ...

>>> + }
>>> +
>>> + if ( d & (Imm|Imm8) )
>>> + {
>>> + if ( d & Imm8 )
>>> + osize = 1;
>>> +
>>> + switch ( osize )
>>> + {
>>> + case 1: FETCH(uint8_t); break;
>>> + case 2: FETCH(uint16_t); break;
>>> + case 4: FETCH(uint32_t); break;
>>> + default: goto bad_osize;
>>> + }
>>> + }
>>> +
>>> + return (x86_decode_lite_t){ ip - start, type, rel };
>>> +
>>> + bad_osize:
>>> + printk(XENLOG_ERR "%s() Bad osize %u in %*ph\n",
>>> + __func__, osize,
>>> + (int)(unsigned long)(end - start), start);
>>> + return (x86_decode_lite_t){ -1, REL_TYPE_NONE, NULL };
>> Maybe limit opcode quoting to ip - start here?
>
> In the case that we've taken the bad_osize path, we've not decoded the
> full instruction. The bytes beyond ip are useful for diagnostics.

Hmm. The reason for the reported failure lies within the [start,ip)
range. Yet I agree that would not be a complete insn. Otoh what is
being patched may be a meaningfully long series of insns, which aren't
useful to all log. If you think including the immediate (the value of
which is of no interest) is relevant, may I then please ask that you
bound logging at the insn size limit of 15?

>>> --- a/xen/arch/x86/x86_emulate/private.h
>>> +++ b/xen/arch/x86/x86_emulate/private.h
>>> @@ -9,7 +9,9 @@
>>> #ifdef __XEN__
>>>
>>> # include <xen/bug.h>
>>> +# include <xen/init.h>
>>> # include <xen/kernel.h>
>>> +# include <xen/livepatch.h>
>> Are both of these really needed here, rather than just in decode-lite.c?
>
> Yes, for the userpsace harness.

The user space harness shouldn't include any Xen headers. Patch context
(the #ifdef at the top) even shows that this Xen-only code.

>>> + void *rel;
>> While I understand the goal of omitting const here and ...
>>
>>> +} x86_decode_lite_t;
>>> +
>>> +x86_decode_lite_t x86_decode_lite(void *ip, void *end);
>> ... here, I still find this fragile / misleading (the function itself, after
>> all, only ever fetches from memory). Even with the goal in mind, surely at
>> least "end" can be pointer-to-const?
>>
>> The (struct) return type would also be easier for the compiler to deal
>> with if it didn't have a pointer field (and hence needs to be 128-bit). How
>> about returning an offset relative to "start"? That would then allow proper
>> constifying of both function parameters as well.
>
> Quite the contrary.
>
> I did initially pack it all into a single GPR, but both the written C
> and code generation of this form is better.
>
> This is what the code generation looks like:
>
> <xdl>:
> 55                      push   %rbp
> 48 89 e5                mov    %rsp,%rbp
> e8 77 fd ff ff          callq ffff82d04033c580 <x86_decode_lite>
> 5d                      pop    %rbp
> 88 05 01 77 2f 00       mov    %al,0x2f7701(%rip)        # <xdl_len>
> 0f b6 c4                movzbl %ah,%eax
> 88 05 f7 76 2f 00       mov    %al,0x2f76f7(%rip)        # <xdl_type>
> 48 89 15 e8 76 2f 00    mov    %rdx,0x2f76e8(%rip)       # <xdl_rel>
> c3                      retq

Well, "easier" in my earlier reply wasn't referring to the complexity
of generated code. Instead I'm slightly wary of compiler issues with
the calling convention for 128-bit struct returns.

> and keeping rel as a full pointer simplifies both sides of the function.

Hmm, I can see that being the case. I wonder whether const-correctness
doesn't weigh higher, though. But I'm also not going to insist.

Jan