Mailing List Archive

[RFC] block layer support for DMA IOMMU bypass mode
Background:
Quite a few servers on the market today include an IOMMU to try to patch
up bus to memory accessibility issues. However, a fair number come with
the caveat that actually using the IOMMU is expensive.
Some IOMMUs come with a "bypass" mode, where the IOMMU won't try to
translate the physical address coming from the device but will instead
place it directly on the memory bus. For some machines (ia-64, and
possibly x86_64) any address not programmed into the IOMMU for
translation is viewed as a bypass. For others (parisc SBA) you have to
assert specific address bus bits to get the bypass.
All IOMMUs supporting the bypass mode will allow it to be used
selectively, so a DMA transfer may include SG both bypass and mapped
segments.
The Problem:
At the moment, the block layer assumes segments may be virtually
mergeable (i.e. two phsically discondiguous pages may be treated as a
single SG entity for DMA because the IOMMU will patch up the
discontinuity) if an IOMMU is present in the system. This effectively
stymies using bypass mode, because segments may not be virtually merged
in a bypass operation.
The Solution:
Is to teach the block layer not to virtually merge segments if either
segment may be bypassed. To that end, the block layer has to know what
the physical dma mask is (not the bounce limit, which is different) and
it must also know the address bits that must be asserted in bypass
mode. To that end, I've introduced a new #define for asm/io.h
BIO_VMERGE_BYPASS_MASK
Which is set either to the physical bits that have to be asserted, or
simply to an address (like 0x1) that will always pass the device's
dma_mask.
I've also introduced a new block layer callback
blk_queue_dma_mask(q, dma_mask)
Who's job is to set the physical dma_mask of the queue (it defaults to
0xffffffff)
You can see how this works in the attached patch (block layer only; the
DMA engines of platforms wishing to take advantage of bypassing would
also have to be altered).
Comments?
James
===== drivers/block/DAC960.c 1.60 vs edited =====
--- 1.60/drivers/block/DAC960.c Fri Jun 6 01:37:51 2003
+++ edited/drivers/block/DAC960.c Tue Jul 1 10:51:21 2003
@@ -2475,6 +2475,8 @@
RequestQueue = &Controller->RequestQueue;
blk_init_queue(RequestQueue, DAC960_RequestFunction, &Controller->queue_lock);
blk_queue_bounce_limit(RequestQueue, Controller->BounceBufferLimit);
+ blk_queue_dma_mask(RequestQueue, Controller->BounceBufferLimit);
+
RequestQueue->queuedata = Controller;
blk_queue_max_hw_segments(RequestQueue,
Controller->DriverScatterGatherLimit);
===== drivers/block/cciss.c 1.82 vs edited =====
--- 1.82/drivers/block/cciss.c Thu Jun 5 08:17:28 2003
+++ edited/drivers/block/cciss.c Tue Jul 1 10:49:56 2003
@@ -2541,6 +2541,7 @@
spin_lock_init(&hba[i]->lock);
blk_init_queue(q, do_cciss_request, &hba[i]->lock);
blk_queue_bounce_limit(q, hba[i]->pdev->dma_mask);
+ blk_queue_dma_mask(q, hba[i]->pdev->dma_mask);

/* This is a hardware imposed limit. */
blk_queue_max_hw_segments(q, MAXSGENTRIES);
===== drivers/block/cpqarray.c 1.77 vs edited =====
--- 1.77/drivers/block/cpqarray.c Wed Jun 11 01:33:24 2003
+++ edited/drivers/block/cpqarray.c Tue Jul 1 10:51:52 2003
@@ -389,6 +389,7 @@
spin_lock_init(&hba[i]->lock);
blk_init_queue(q, do_ida_request, &hba[i]->lock);
blk_queue_bounce_limit(q, hba[i]->pci_dev->dma_mask);
+ blk_queue_dma_mask(q, hba[i]->pci_dev->dma_mask);

/* This is a hardware imposed limit. */
blk_queue_max_hw_segments(q, SG_MAX);
===== drivers/block/ll_rw_blk.c 1.174 vs edited =====
--- 1.174/drivers/block/ll_rw_blk.c Mon Jun 2 20:32:46 2003
+++ edited/drivers/block/ll_rw_blk.c Tue Jul 1 10:39:46 2003
@@ -213,12 +213,32 @@
* by default assume old behaviour and bounce for any highmem page
*/
blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
+ /*
+ * and assume a 32 bit dma mask
+ */
+ blk_queue_dma_mask(q, 0xffffffff);

init_waitqueue_head(&q->queue_wait);
INIT_LIST_HEAD(&q->plug_list);
}

/**
+ * blk_queue_dma_mask - set queue dma mask
+ * @q: the request queue for the device
+ * @dma_addr: bus address limit
+ *
+ * Description:
+ * This will set the device physical DMA mask. This is used by
+ * the bio layer to arrange the segments correctly for IOMMUs that
+ * can be programmed in bypass mode. Note: setting this does *not*
+ * change whether the device goes through an IOMMU or not
+ **/
+void blk_queue_dma_mask(request_queue_t *q, u64 dma_mask)
+{
+ q->dma_mask = dma_mask;
+}
+
+/**
* blk_queue_bounce_limit - set bounce buffer limit for queue
* @q: the request queue for the device
* @dma_addr: bus address limit
@@ -746,7 +766,7 @@
continue;
}
new_segment:
- if (!bvprv || !BIOVEC_VIRT_MERGEABLE(bvprv, bv))
+ if (!bvprv || !BIOVEC_VIRT_MERGEABLE(q, bvprv, bv))
nr_hw_segs++;

nr_phys_segs++;
@@ -787,7 +807,7 @@
if (!(q->queue_flags & (1 << QUEUE_FLAG_CLUSTER)))
return 0;

- if (!BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(nxt)))
+ if (!BIOVEC_VIRT_MERGEABLE(q, __BVEC_END(bio), __BVEC_START(nxt)))
return 0;
if (bio->bi_size + nxt->bi_size > q->max_segment_size)
return 0;
@@ -909,7 +929,7 @@
return 0;
}

- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(req->biotail), __BVEC_START(bio)))
+ if (BIOVEC_VIRT_MERGEABLE(q, __BVEC_END(req->biotail), __BVEC_START(bio)))
return ll_new_mergeable(q, req, bio);

return ll_new_hw_segment(q, req, bio);
@@ -924,7 +944,7 @@
return 0;
}

- if (BIOVEC_VIRT_MERGEABLE(__BVEC_END(bio), __BVEC_START(req->bio)))
+ if (BIOVEC_VIRT_MERGEABLE(q, __BVEC_END(bio), __BVEC_START(req->bio)))
return ll_new_mergeable(q, req, bio);

return ll_new_hw_segment(q, req, bio);
===== drivers/ide/ide-lib.c 1.8 vs edited =====
--- 1.8/drivers/ide/ide-lib.c Thu Mar 6 17:27:52 2003
+++ edited/drivers/ide/ide-lib.c Tue Jul 1 10:49:14 2003
@@ -406,6 +406,9 @@
addr = HWIF(drive)->pci_dev->dma_mask;
}

+ if((HWIF(drive)->pci_dev))
+ blk_queue_dma_mask(&drive->queue,
+ HWIF(drive)->pci_dev->dma_mask);
blk_queue_bounce_limit(&drive->queue, addr);
}

===== drivers/scsi/scsi_lib.c 1.99 vs edited =====
--- 1.99/drivers/scsi/scsi_lib.c Sun Jun 29 20:14:44 2003
+++ edited/drivers/scsi/scsi_lib.c Tue Jul 1 10:54:19 2003
@@ -1256,6 +1256,8 @@
blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS);
blk_queue_max_sectors(q, shost->max_sectors);
blk_queue_bounce_limit(q, scsi_calculate_bounce_limit(shost));
+ if(scsi_get_device(shost) && scsi_get_device(shost)->dma_mask)
+ blk_queue_dma_mask(q, *scsi_get_device(shost)->dma_mask);

if (!shost->use_clustering)
clear_bit(QUEUE_FLAG_CLUSTER, &q->queue_flags);
===== include/linux/bio.h 1.32 vs edited =====
--- 1.32/include/linux/bio.h Tue Jun 10 07:54:25 2003
+++ edited/include/linux/bio.h Tue Jul 1 11:20:26 2003
@@ -30,6 +30,11 @@
#define BIO_VMERGE_BOUNDARY 0
#endif

+/* Can the IOMMU (if any) work in bypass mode */
+#ifndef BIO_VMERGE_BYPASS_MASK
+#define BIO_VMERGE_BYPASS_MASK 0
+#endif
+
#define BIO_DEBUG

#ifdef BIO_DEBUG
@@ -156,6 +161,13 @@

#define __bio_kunmap_atomic(addr, kmtype) kunmap_atomic(addr, kmtype)

+/* can the bio vec be directly physically addressed by the device */
+#define __BVEC_PHYS_DIRECT_OK(q, vec) \
+ ((bvec_to_phys(vec) & (q)->dma_mask) == bvec_to_phys(vec))
+/* Is the queue dma_mask eligible to be bypassed */
+#define __BIO_CAN_BYPASS(q) \
+ ((BIO_VMERGE_BYPASS_MASK) && ((q)->dma_mask & (BIO_VMERGE_BYPASS_MASK)) == (BIO_VMERGE_BYPASS_MASK))
+
/*
* merge helpers etc
*/
@@ -164,8 +176,10 @@
#define __BVEC_START(bio) bio_iovec_idx((bio), 0)
#define BIOVEC_PHYS_MERGEABLE(vec1, vec2) \
((bvec_to_phys((vec1)) + (vec1)->bv_len) == bvec_to_phys((vec2)))
-#define BIOVEC_VIRT_MERGEABLE(vec1, vec2) \
- ((((bvec_to_phys((vec1)) + (vec1)->bv_len) | bvec_to_phys((vec2))) & (BIO_VMERGE_BOUNDARY - 1)) == 0)
+#define BIOVEC_VIRT_MERGEABLE(q, vec1, vec2) \
+ (((((bvec_to_phys((vec1)) + (vec1)->bv_len) | bvec_to_phys((vec2))) & (BIO_VMERGE_BOUNDARY - 1)) == 0) \
+ && !( __BIO_CAN_BYPASS(q) && (__BVEC_PHYS_DIRECT_OK(q, vec1) \
+ || __BVEC_PHYS_DIRECT_OK(q, vec2))))
#define __BIO_SEG_BOUNDARY(addr1, addr2, mask) \
(((addr1) | (mask)) == (((addr2) - 1) | (mask)))
#define BIOVEC_SEG_BOUNDARY(q, b1, b2) \
===== include/linux/blkdev.h 1.109 vs edited =====
--- 1.109/include/linux/blkdev.h Wed Jun 11 20:17:55 2003
+++ edited/include/linux/blkdev.h Tue Jul 1 10:39:18 2003
@@ -255,6 +255,12 @@
unsigned long bounce_pfn;
int bounce_gfp;

+ /*
+ * The physical dma_mask for the queue (used to make IOMMU
+ * bypass decisions)
+ */
+ u64 dma_mask;
+
struct list_head plug_list;

/*
@@ -458,6 +464,7 @@
extern void blk_cleanup_queue(request_queue_t *);
extern void blk_queue_make_request(request_queue_t *, make_request_fn *);
extern void blk_queue_bounce_limit(request_queue_t *, u64);
+extern void blk_queue_dma_mask(request_queue_t *, u64);
extern void blk_queue_max_sectors(request_queue_t *, unsigned short);
extern void blk_queue_max_phys_segments(request_queue_t *, unsigned short);
extern void blk_queue_max_hw_segments(request_queue_t *, unsigned short);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On 01 Jul 2003 11:46:12 -0500
James Bottomley <James.Bottomley@steeleye.com> wrote:
>
> Some IOMMUs come with a "bypass" mode, where the IOMMU won't try to
> translate the physical address coming from the device but will instead
> place it directly on the memory bus. For some machines (ia-64, and
> possibly x86_64) any address not programmed into the IOMMU for
That's the case on x86_64 yes.
> The Problem:
>
> At the moment, the block layer assumes segments may be virtually
> mergeable (i.e. two phsically discondiguous pages may be treated as a
> single SG entity for DMA because the IOMMU will patch up the
> discontinuity) if an IOMMU is present in the system. This effectively
> stymies using bypass mode, because segments may not be virtually merged
> in a bypass operation.
I assume on 2.5 has this problem, not 2.4, right?
>
> The Solution:
>
> Is to teach the block layer not to virtually merge segments if either
> segment may be bypassed. To that end, the block layer has to know what
> the physical dma mask is (not the bounce limit, which is different) and
> it must also know the address bits that must be asserted in bypass
> mode. To that end, I've introduced a new #define for asm/io.h
>
> BIO_VMERGE_BYPASS_MASK
But a mask is not good for AMD64 because there is no guarantee
that the bypass/iommu address is checkable using a mask
(K8 uses an memory hole for IOMMU purposes and for various
reasons the hole can be anywhere in the address space)
This means x86_64 needs an function. Also the name is quite weird and
the issue is not really BIO specific. How about just calling it
iommu_address() ?
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, 2003-07-01 at 12:09, Andi Kleen wrote:
> I assume on 2.5 has this problem, not 2.4, right?
Yes, sorry, I'm so focussed on 2.5 I keep forgetting 2.4.
> But a mask is not good for AMD64 because there is no guarantee
> that the bypass/iommu address is checkable using a mask
> (K8 uses an memory hole for IOMMU purposes and for various
> reasons the hole can be anywhere in the address space)
>
> This means x86_64 needs an function. Also the name is quite weird and
> the issue is not really BIO specific. How about just calling it
> iommu_address() ?
The name was simply to be consistent with BIO_VMERGE_BOUNDARY which is
another asm/io.h setting for this.
Could you elaborate more on the amd64 IOMMU window. Is this a window
where IOMMU mapping always takes place?
I'm a bit reluctant to put a function like this in because the block
layer does a very good job of being separate from the dma layer.
Maintaining this separation is one of the reasons I added a dma_mask to
the request_queue, not a generic device pointer.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On 01 Jul 2003 12:28:47 -0500
James Bottomley <James.Bottomley@steeleye.com> wrote:
> Could you elaborate more on the amd64 IOMMU window. Is this a window
> where IOMMU mapping always takes place?
Yes.
K8 doesn't have a real IOMMU. Instead it extended the AGP aperture to work
for PCI devices too. The AGP aperture is a hole in memory configured
at boot, normally mapped directly below 4GB, but it can be elsewhere
(it's actually an BIOS option on machines without AGP chip and when
the BIOS option is off Linux allocates some memory and puts the hole
on top of it. This allocated hole can be anywhere in the first 4GB)
Inside the AGP aperture memory is always remapped, you get a bus abort
when you access an area in there that is not mapped.
In short to detect it it needs to test against an address range,
a mask is not enough.
>
> I'm a bit reluctant to put a function like this in because the block
> layer does a very good job of being separate from the dma layer.
> Maintaining this separation is one of the reasons I added a dma_mask to
> the request_queue, not a generic device pointer.
Not sure I understand why you want to do this in the block layer.
It's a generic extension of the PCI DMA API. The block devices/layer itself
has no business knowing such intimate details about the pci dma
implementation, it should just ask.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
Followup to: <1057080529.2003.62.camel@mulgrave>
By author: James Bottomley <James.Bottomley@steeleye.com>
In newsgroup: linux.dev.kernel
>
> The name was simply to be consistent with BIO_VMERGE_BOUNDARY which is
> another asm/io.h setting for this.
>
> Could you elaborate more on the amd64 IOMMU window. Is this a window
> where IOMMU mapping always takes place?
>
It's a window (in the form of a BAR - base and mask) within which
IOMMU mapping always takes place. Outside the window everything is
bypass.
This applies to all x86-64 machines and some i386 machines, in
particular those i386 chipsets with "full GART" support as opposed to
"AGP only GART" (my terminology.)
Andi likes to say this isn't a real IOMMU (mostly because it doesn't
solve the legacy region problem), but I disagree with that view. It
still would be nicer if it covered more address space, though.
I don't know if it would be worthwhile to support "full GART" on the
i386 systems which support it.
-hpa
--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, Jul 01, 2003 at 11:46:12AM -0500, James Bottomley wrote:
...
> However, a fair number come with
> the caveat that actually using the IOMMU is expensive.
Clarification:
IOMMU mapping is slightly more expensive than direct physical on HP boxes.
(yes davem, you've told me how wonderful sparc IOMMU is ;^)
But obviously alot less expensive than bounce buffers.
> The Problem:
>
> At the moment, the block layer assumes segments may be virtually
> mergeable (i.e. two phsically discondiguous pages may be treated as a
> single SG entity for DMA because the IOMMU will patch up the
> discontinuity) if an IOMMU is present in the system.
The symptom is drivers which have limits on DMA entries will return
errors (or crash) when the IOMMU code doesn't actually merge as much
as the BIO code expected.
Specifically, sym53c8xx_2 only takes 96 DMA entries per IO and davidm
hit that pretty easily on ia64.
MPT/Fusion (LSI u32) doesn't seem to have a limit.
IDE limit is PAGE_SIZE/8 (or 16k/8=2k for ia64).
I haven't checked other drivers.
...
> +/* Is the queue dma_mask eligible to be bypassed */
> +#define __BIO_CAN_BYPASS(q) \
> + ((BIO_VMERGE_BYPASS_MASK) && ((q)->dma_mask & (BIO_VMERGE_BYPASS_MASK)) == (BIO_VMERGE_BYPASS_MASK))
Like Andi, I had suggested a callback into IOMMU code here.
But I'm pretty sure james proposal will work for ia64 and parisc.
Ideally, I don't like to see two seperate chunks of code performing
the "let's see what I can merge now" loops. Instead, BIO could merge
"intra-page" segments and call the IOMMU code to "merge" remaining
"inter-page" segments. IOMMU code needs to know how many physical entries
are allowed (when to stop processing) and could return the number of
sg list entries it was able to merge.
thanks james!
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, Jul 01, 2003 at 07:42:41PM +0200, Andi Kleen wrote:
> K8 doesn't have a real IOMMU. Instead it extended the AGP aperture to work
> for PCI devices too.
*gag*...sounds like exactly the opposite HP ZX1 workstations do.
They used part of the SBA IOMMU for AGP GART.
thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, 2003-07-01 at 12:42, Andi Kleen wrote:
> K8 doesn't have a real IOMMU. Instead it extended the AGP aperture to work
> for PCI devices too. The AGP aperture is a hole in memory configured
> at boot, normally mapped directly below 4GB, but it can be elsewhere
> (it's actually an BIOS option on machines without AGP chip and when
> the BIOS option is off Linux allocates some memory and puts the hole
> on top of it. This allocated hole can be anywhere in the first 4GB)
> Inside the AGP aperture memory is always remapped, you get a bus abort
> when you access an area in there that is not mapped.
>
> In short to detect it it needs to test against an address range,
> a mask is not enough.
It sounds like basically anything not physically in the window is
bypassable, so you just set BIO_VMERGE_BYPASS_MASK to 1. Thus, any
segment within the device's dma_mask gets bypassed, and anything that's
not has to be remapped within the window.
I don't see where you need to put extra information into the virtual
merging process.
> > I'm a bit reluctant to put a function like this in because the block
> > layer does a very good job of being separate from the dma layer.
> > Maintaining this separation is one of the reasons I added a dma_mask to
> > the request_queue, not a generic device pointer.
>
> Not sure I understand why you want to do this in the block layer.
> It's a generic extension of the PCI DMA API. The block devices/layer itself
> has no business knowing such intimate details about the pci dma
> implementation, it should just ask.
Virtual merging is already part of the block layer. It actually
interferes with the ability to bypass the IOMMU because you can't merge
virtually if you want to do a bypass.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
Grant Grundler wrote:
>
> But I'm pretty sure james proposal will work for ia64 and parisc.
>
The thing that's got me concerned about this is that it allows
for sg lists that contains both entries that the block layer
expects will be mapped into the iommu and ones that it expects
to bypass. I don't like the implications of parsing through
sg lists looking for bypass-able and non-bypass-able groupings.
This seems like a lot more overhead than we have now and the
complexity of merging partially bypass-able scatterlists seems
time consuming.
The current ia64 sba_iommu does a quick and dirty sg bypass
check. If the device can dma to any memory address, the entire
sg list is bypassed. If not, the entire list is coalesced and
mapped by the iommu. The idea being that true performance
devices will have 64bit dma masks and be able to quickly bypass.
Everything else will at least get the benefit of coalescing
entries to make more efficient dma. The coalescing is a bit
simpler since it's the entire list as well. With this proposal,
we'd have to add a lot of complexity to partially bypass sg
lists. I don't necessarily see that as a benefit. Thanks,
Alex
--
Alex Williamson HP Linux & Open Source Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, 2003-07-01 at 14:19, Grant Grundler wrote:
> On Tue, Jul 01, 2003 at 11:46:12AM -0500, James Bottomley wrote:
> > +/* Is the queue dma_mask eligible to be bypassed */
> > +#define __BIO_CAN_BYPASS(q) \
> > + ((BIO_VMERGE_BYPASS_MASK) && ((q)->dma_mask & (BIO_VMERGE_BYPASS_MASK)) == (BIO_VMERGE_BYPASS_MASK))
>
> Like Andi, I had suggested a callback into IOMMU code here.
> But I'm pretty sure james proposal will work for ia64 and parisc.
OK, the core of my objection to this is that at the moment there's no
entangling of the bio layer and the DMA layer. The bio layer works with
a nice finite list of generic or per-queue constraints; it doesn't care
currently what the underlying device or IOMMU does. Putting such a
callback in would add this entanglement.
It could be that the bio people will be OK with this, and I'm just
worrying about nothing, but in that case, they need to say so...
James
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, 2003-07-01 at 14:59, Alex Williamson wrote:
> The thing that's got me concerned about this is that it allows
> for sg lists that contains both entries that the block layer
> expects will be mapped into the iommu and ones that it expects
> to bypass. I don't like the implications of parsing through
> sg lists looking for bypass-able and non-bypass-able groupings.
> This seems like a lot more overhead than we have now and the
> complexity of merging partially bypass-able scatterlists seems
> time consuming.
>
> The current ia64 sba_iommu does a quick and dirty sg bypass
> check. If the device can dma to any memory address, the entire
> sg list is bypassed. If not, the entire list is coalesced and
> mapped by the iommu. The idea being that true performance
> devices will have 64bit dma masks and be able to quickly bypass.
> Everything else will at least get the benefit of coalescing
> entries to make more efficient dma. The coalescing is a bit
> simpler since it's the entire list as well. With this proposal,
> we'd have to add a lot of complexity to partially bypass sg
> lists. I don't necessarily see that as a benefit. Thanks,
But if that's all you want, you simply set the BIO_VMERGE_BYPASS_MASK to
the full u64 set bitmask. Then it will only turn off virtual merging
for devices that have a fully set dma_mask, and your simple test will
work.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
I personally don't care how this is done, as long as I can make
all the overhead from the checks go away on my platform by
defining the interface macro to do nothing :-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, Jul 01, 2003 at 03:03:45PM -0500, James Bottomley wrote:
> OK, the core of my objection to this is that at the moment there's no
> entangling of the bio layer and the DMA layer.
I agree this is a good thing.
> The bio layer works with
> a nice finite list of generic or per-queue constraints; it doesn't care
> currently what the underlying device or IOMMU does.
I don't agree. This whole discussion revolves around getting BIO code and
IOMMU code to agree on how block merging works for a given platform.
Using a callback into IOMMU code means the BIO truly doesn't have to know.
The platform specific IOMMU could just tell BIO code what it wants to
know (how many SG entries would fit into a limited number of physical
mappings).
> Putting such a callback in would add this entanglement.
yes, sort of. But I think this entanglement is present even for machines
that don't have an IOMMU because of bounce buffers. But if ia64's swiotlb
would be made generic to cover buffer bouncing....
> It could be that the bio people will be OK with this, and I'm just
> worrying about nothing, but in that case, they need to say so...
Would that be Jens Axboe/Dave Miller/et al?
thanks,
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On 01 Jul 2003 11:46:12 -0500
James Bottomley <James.Bottomley@steeleye.com> wrote:
On further thought about the issue:
The K8 IOMMU cannot support this virtually contiguous thing. The reason
is that there is no guarantee that an entry in a sglist is a multiple
of page size. And the aperture can only map 4K sized chunks, like
a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
no way to get them continuous in IOMMU space (short of copying)
This means I just need a flag to turn this assumption off in the block layer.
Currently it doesn't even guarantee that pci_map_sg is continuous for page sized chunks - pci_map_sg is essentially just a loop that calls pci_map_single
and is quite possible that all the entries are spread over the IOMMU hole.
Also James do you remember when these changes were added to the block layer?
We have a weird IDE corruption here and I'm wondering if it is related
to this.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
From: Andi Kleen <ak@suse.de>
Date: Wed, 2 Jul 2003 01:57:01 +0200

The K8 IOMMU cannot support this virtually contiguous thing. The reason
is that there is no guarantee that an entry in a sglist is a multiple
of page size. And the aperture can only map 4K sized chunks, like
a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
What do you mean? You map only one 4K chunk, and this is used
for all the sub-1K mappings.
I can only map 8K sized chunks on the sparc64 IOMMU and this
works perfectly fine.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
From: Andi Kleen <ak@suse.de>
Date: Wed, 2 Jul 2003 02:22:44 +0200
On Tue, 01 Jul 2003 17:03:23 -0700 (PDT)
"David S. Miller" <davem@redhat.com> wrote:

> What do you mean? You map only one 4K chunk, and this is used
> for all the sub-1K mappings.

How should this work when the 1K mappings are spread all over memory?

Maybe I'm missing something but from James description it sounds like the
block layer assumes that it can pass in a sglist with arbitary elements
and get it back remapped to continuous DMA addresses.

It assumes it can pass in an sglist with arbitrary "virtually
contiguous" elements and get back a continuous DMA address.
The BIO_VMERGE_BOUNDRY defines the IOMMU page size and therefore
what "virtually contiguous" means.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Tue, 01 Jul 2003 17:03:23 -0700 (PDT)
"David S. Miller" <davem@redhat.com> wrote:
> From: Andi Kleen <ak@suse.de>
> Date: Wed, 2 Jul 2003 01:57:01 +0200
>
> The K8 IOMMU cannot support this virtually contiguous thing. The reason
> is that there is no guarantee that an entry in a sglist is a multiple
> of page size. And the aperture can only map 4K sized chunks, like
> a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
> What do you mean? You map only one 4K chunk, and this is used
> for all the sub-1K mappings.
How should this work when the 1K mappings are spread all over memory?
Maybe I'm missing something but from James description it sounds like the
block layer assumes that it can pass in a sglist with arbitary elements
and get it back remapped to continuous DMA addresses.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode [ In reply to ]
On Tue, 2003-07-01 at 18:01, Grant Grundler wrote:
> > The bio layer works with
> > a nice finite list of generic or per-queue constraints; it doesn't care
> > currently what the underlying device or IOMMU does.
>
> I don't agree. This whole discussion revolves around getting BIO code and
> IOMMU code to agree on how block merging works for a given platform.
> Using a callback into IOMMU code means the BIO truly doesn't have to know.
> The platform specific IOMMU could just tell BIO code what it wants to
> know (how many SG entries would fit into a limited number of physical
> mappings).
Ah, but the point is that currently the only inputs the IOMMU has to the
bio layer are parameters. I'd like to keep it this way unless there's a
really, really good reason not to. At the moment it seems that the
proposed parameter covers all of IA64's needs and may cover AMD64's as
well.
> > Putting such a callback in would add this entanglement.
>
> yes, sort of. But I think this entanglement is present even for machines
> that don't have an IOMMU because of bounce buffers. But if ia64's swiotlb
> would be made generic to cover buffer bouncing....
Well, not to get into the "where should ZONE_NORMAL end" argument again,
but I was hoping that GFP_DMA32 would elminate the IA64 platform's need
for this. __blk_queue_bounce() strikes me as being much more heavily
exercised than the swiotlb, so I think it should be the one to remain.
It also has more context information to fail gracefully.
James
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Wed, Jul 02, 2003 at 02:22:44AM +0200, Andi Kleen wrote:
> > What do you mean? You map only one 4K chunk, and this is used
> > for all the sub-1K mappings.
>
> How should this work when the 1K mappings are spread all over memory?
It couldn't merge in this case.
> Maybe I'm missing something but from James description it sounds like the
> block layer assumes that it can pass in a sglist with arbitary elements
> and get it back remapped to continuous DMA addresses.
In the x86-64 case, If the 1k elements are not physically contigous,
I think most of them would get their own mapping.
For x86-64, if an entry ends on a 4k alignment and the next one starts
on a 4k alignment, could those be merged into one DMA segment that uses
two adjacent mapping entries?
Anyway, using a 4k FS block size (eg ext2) would be more efficient
by allowing a 1:1 of SG elements and DMA mappings.
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Wed, Jul 02, 2003 at 01:57:01AM +0200, Andi Kleen wrote:
> The K8 IOMMU cannot support this virtually contiguous thing. The reason
> is that there is no guarantee that an entry in a sglist is a multiple
> of page size. And the aperture can only map 4K sized chunks, like
> a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
> no way to get them continuous in IOMMU space (short of copying)
Can two adjacent IOMMU entries be used to map two 1K buffers?
Assume the 1st buffer ends on a 4k alignment and the next one
starts on a 4k alignment.
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Wed, Jul 02, 2003 at 10:53:33AM -0600, Grant Grundler wrote:
>
> > Maybe I'm missing something but from James description it sounds like the
> > block layer assumes that it can pass in a sglist with arbitary elements
> > and get it back remapped to continuous DMA addresses.
>
> In the x86-64 case, If the 1k elements are not physically contigous,
> I think most of them would get their own mapping.
Yes, but it won't be continguous in bus space.
>
> For x86-64, if an entry ends on a 4k alignment and the next one starts
> on a 4k alignment, could those be merged into one DMA segment that uses
> two adjacent mapping entries?
Yes, it is now in the version I wrote last night, but not in the
previous code that's in the tree.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Wed, Jul 02, 2003 at 10:55:10AM -0600, Grant Grundler wrote:
> On Wed, Jul 02, 2003 at 01:57:01AM +0200, Andi Kleen wrote:
> > The K8 IOMMU cannot support this virtually contiguous thing. The reason
> > is that there is no guarantee that an entry in a sglist is a multiple
> > of page size. And the aperture can only map 4K sized chunks, like
> > a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
> > no way to get them continuous in IOMMU space (short of copying)
>
> Can two adjacent IOMMU entries be used to map two 1K buffers?
> Assume the 1st buffer ends on a 4k alignment and the next one
> starts on a 4k alignment.
Yes, it could. But is that situation likely/worth to handle?
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Wed, Jul 02, 2003 at 07:20:26PM +0200, Andi Kleen wrote:
...
> > Can two adjacent IOMMU entries be used to map two 1K buffers?
> > Assume the 1st buffer ends on a 4k alignment and the next one
> > starts on a 4k alignment.
>
> Yes, it could. But is that situation likely/worth to handle?
Probably. It would reduce the number of mappings by 25% (3 instead of 4).
My assumption is two adjecent IOMMU entries have contigious bus addresses.
I was trying to figure out if x86-64 should be setting
BIO_VMERGE_BOUNDARY to 0 or 4k.
It sounds like x86-64 could support "#define BIO_VMERGE_BOUNDARY 4096"
if the IOMMU code will return one DMA address for two SG list entries
in the above example.
hth,
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
On Mer, 2003-07-02 at 17:55, Grant Grundler wrote:
> On Wed, Jul 02, 2003 at 01:57:01AM +0200, Andi Kleen wrote:
> > The K8 IOMMU cannot support this virtually contiguous thing. The reason
> > is that there is no guarantee that an entry in a sglist is a multiple
> > of page size. And the aperture can only map 4K sized chunks, like
> > a CPU MMU. So e.g. when you have an sglist with multiple 1K entries there is
> > no way to get them continuous in IOMMU space (short of copying)
>
> Can two adjacent IOMMU entries be used to map two 1K buffers?
> Assume the 1st buffer ends on a 4k alignment and the next one
> starts on a 4k alignment.
When I played with optimising merging on some 2.4 I2O and aacraid
controller stuff I found two things
1. We allocate pages in reverse order so most merges cant occur
2. If you use a 4K fs as most people do now the issue is irrelevant
3. Almost every 1K mergable was part of the same 4K page anyway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] block layer support for DMA IOMMU bypass mode II [ In reply to ]
> 1. We allocate pages in reverse order so most merges cant occur
I added an printk and I get quite a lot of merges during bootup
with normal IDE.
(sometimes 12+ segments)
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2 3  View All