Mailing List Archive: [RFC] killing the NR

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

Feb 17, 2007, 8:58 PM

Post #26 of 40 (1229 views)

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> The only time it really makes sense to me to let the irq number vary
>> arbitrary are when things are truly dynamic, like with MSI, a
>> hypervisor, or hot plug interrupt controllers.
>
> I don't understand why you would go to all that lenght to replace irq
> numbers with irq_desc * and ... keep then numbers :-)

Because I don't have something better to replace them with.

We need names for irqs, currently the kernel/user space interface
is a unsigned number.

Printing out a pointer where we currently have an integer in:
/proc/interrupts
/proc/irq/N/...
/sys/devices/pci0000:00/0000:00:0e.0/irq
is a bad practice, and if I don't retain the number that is
my only choice.

I similar problem exists in all of the initialization messages from
device drivers that display their irq number.

Plus I think there are also a few ioctls that return the linux
irq number.

Now it may make sense to replace my irq_nr() with irq_name(), and
return a string that can be used instead, but fixing the kernel
user space interface is a third step that is a lot more delicate
and will require more thinking. So I would prefer to put that
off until all of the internal users are using a pointer. Then
we can grep for irq_nr and see how many places we actually export
the irq number to user space.

The fact that the user space has been put in charge of when to
migrate an irq from cpu to another makes this double delicate.

>> Sure, and I have the same issue with a big "DESIGNED FOR ppc" in the middle,
>> or "DESIGNED FOR arch/x". However the unfortunate truth is that the x86
>> has enough volume that frequently other architectures use some x86
>> hardware and thus get some of x86's warts. So anything that doesn't
>> cope with the x86's warts is frequently doomed to failure.
>
> I fait to see how what I described would not apply nicely to x86 ..

The model can be made to work if you force it but it isn't really
a good fit.

I can't really use the (cpu#, vector#) tuple as hw number as it
varies at runtime, and a single interrupt can send different (cpu#,
vector#) tuples from one interrupt message to the next without being
reprogrammed. At least I don't have the impression that you support
multiple hardware numbers going to the same linux irq. But this
really is the layer where I need the reverse mapping. However I can
optimize the reverse mapping by taking advantage of the per cpu
nature.

Currently the hardware number that I use is the pin number on
the ioapic. And to form the linux irq I just add the number
of pins of all previous ioapics together and then add my pin number.
Fairly simple.

Doing the above gives me stable names that are the same from one boot
to the next if someone doesn't change how the hardware is put
together. It looks to me that if I adapt the ppc scheme my irq
numbers will change from one boot to the next one kernel to the next,
almost at random. Depending on driver initialization order and
similar things. Having names that change all of the time is confusing
and not very useful. The fact that in the process of making my names
stable it actually happens to reflect part of the irq hardware
topology is incidental.

Giving up stable names is not something I want to do.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

ebiederm at xmission

Feb 17, 2007, 10:30 PM

Post #27 of 40 (1226 views)

Permalink

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sat, 2007-02-17 at 02:06 -0700, Eric W. Biederman wrote:

> However, PowerPC is a good example because it has such a diversity of
> very different hardware setups to deal with, ranging from the multiple
> layers of cascading controllers all over the place, to interrupts
> packets encoding vector/target etc... a bit like x86 on cell, to
> hypervisors providing a single giant number space etc etc etc...
>
> Thus, it is extremely likely that something that works well for PowerPC
> (or for ARM for that matter as it's probably as a "colorful" environment
> as PowerPC is) will end up being useful for others.

Sure I agree. Part of what I'm trying to say is that it appears
that basic interrupt handling assumptions seem to be inherent to the
architectures. And as much as it surprises me because of basic
assumptions I don't think there is any architecture with every flavor
of color.

>> I have a version of the x86 code with a partial conversion done and
>> I didn't need a reverse mapping. What you call the hardware interrupt
>> number never happens to be interesting to me after the system is setup.
>
> Because you have the ability to tell your PIC to give you your "linux"
> interrupt number when actually sending the interrupt to the processor ?
> You need a way to get to the irq_desc * when getting an IRQ, either you
> have a way to map HW numbers back to irq_desc * in sofrware, or your HW
> allows you to do it.

I don't think is totally foreign, but in essence I have two kinds of
hardware number. An (apic, pin) pair that I need when talking to
the hardware itself and a (cpu, vector) pair that I use when handling
an interrupt. The vector number has never been the linux irq number
but at times it has only needed a simple offset adjustment. Now that
we are having to handle bigger cases only the (apic, pin) pairs that
are actually used get a (cpumask_t, vector) assigned to them.

It may be that the only difference from the cell is that I have a very
small vector number I have to cope with instead of being able to tell
the irq controller to give me something immediately useful.

> I'm saying that if we're going to change the IRQ stuff that deeply, it
> would be nice if we looked into some of that stuff I've done that I
> beleive would be of use for other archs.

Reasonable.

For the first pass when I do the genirq conversion passing struct
irq_desc *irq instead of unsigned int irq, I should be able to
do something stupid and correct on all of the architectures. When
the start taking advantage of the new freedom though generic helpers
can be good.

> I found it overall very useful to have a generic remapping core and have
> cascaded PIC setups have a numbering domain local to a given PIC (pretty
> much, a domain != an irq_chip) and I'm convinced it would make life
> easier for archs with similar setups. The remapping core also shows its
> usefulness on archs with very big interrupt numbers, like sparc or
> pSeries ppc, and possibly others.

Except for the what appears to be instability of the irq numbers on
simpler configurations I don't have a problem with it.

> Now, I -do- have a problem with one aspect of your proposed design which
> is to keep the "linux" interrupt number in the generic irq_desc, which I
> think defeats most of the purpose of moving away from those linux irq
> numbers. If you do so, then I'll have to keep a separate remapping layer
> and keep a mecanism for virtualizing linux numbers.

Until we find a solution for the user space side of things we seem to
need the unsigned int irq number for user space. Now I don't want
people mapping back and forth which is why I don't intend to provide a
reverse function.

But of course there will be a for_each_irq in the genirq layer so if
people really want to they will be able to go from the linux irq to
an irq_desc. But we don't have to export that generically (except
possibly something for the isa irqs).

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

benh at kernel

Feb 18, 2007, 11:58 AM

Post #28 of 40 (1210 views)

Permalink

> Because I don't have something better to replace them with.
>
> We need names for irqs, currently the kernel/user space interface
> is a unsigned number.

.../...

Ok, as long as it's strictly a userspace issue, I understand.

> The model can be made to work if you force it but it isn't really
> a good fit.
>
> I can't really use the (cpu#, vector#) tuple as hw number as it
> varies at runtime, and a single interrupt can send different (cpu#,
> vector#) tuples from one interrupt message to the next without being
> reprogrammed. At least I don't have the impression that you support
> multiple hardware numbers going to the same linux irq. But this
> really is the layer where I need the reverse mapping. However I can
> optimize the reverse mapping by taking advantage of the per cpu
> nature.
>
> Currently the hardware number that I use is the pin number on
> the ioapic. And to form the linux irq I just add the number
> of pins of all previous ioapics together and then add my pin number.
> Fairly simple.

Ok.

> Doing the above gives me stable names that are the same from one boot
> to the next if someone doesn't change how the hardware is put
> together. It looks to me that if I adapt the ppc scheme my irq
> numbers will change from one boot to the next one kernel to the next,
> almost at random.

Not necessarily. The current code would though in practice, it doesn't
change much, but as I said, it would be trivial to change it so that a
domain using a linear reverse map is fully "allocated" when initialized.

Stable numbers are somewhat useful for users, thus one of the thing the
ppc code tries to do in order to get "mostly" stable numbers/names as
well is to try to use the HW number as a "starting" value when searching
for a linux irq number to assign. Only if it's not possible (the HW
number is in the reserved ISA range, too big, or the matching linux
number is already used) then it will allocate a new one.

I still think I need to add the domain and HW number to /proc/interrupts
though (or to some other sysfs file) in order to provide the mappign to
virqs to userland.

Also, if the eixsting linear and radix tree reverse maps don't fit your
needs, you can create a new one or request no reverse map from the core
at all and do everything in your arch code.

Basically, what I'm saying is that I'd like the concept of domain &
generic remappers to become generic. sparc64 and ppc use it and I'm
convinced it would be useful to other especially archs with lots of
cascaded PICs like ARM. And by doing so, we can also make it easier to
expose the HW -> virq mapping to userland in a common place with a
consistent format since it would be done by generic code.

Cheers,
Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

benh at kernel

Feb 18, 2007, 12:01 PM

Post #29 of 40 (1225 views)

Permalink

> Except for the what appears to be instability of the irq numbers on
> simpler configurations I don't have a problem with it.

I agree that's a bit annoying and I beleive it can be fixed. In
additionm I'd like to look into exposing the domain/HW number -> virq
mapping somewhere in sysfs maybe.

> Until we find a solution for the user space side of things we seem to
> need the unsigned int irq number for user space. Now I don't want
> people mapping back and forth which is why I don't intend to provide a
> reverse function.

Ok.

> But of course there will be a for_each_irq in the genirq layer so if
> people really want to they will be able to go from the linux irq to
> an irq_desc. But we don't have to export that generically (except
> possibly something for the isa irqs).

Ok.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

arjan at infradead

Feb 18, 2007, 1:24 PM

Post #30 of 40 (1222 views)

Permalink

On Fri, 2007-02-16 at 13:41 +0100, Ingo Molnar wrote:
> * Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> > So I propose we remove all assumptions from the code that we actually
> > have an array of irqs. That will allow for irq_desc to be dynamically
> > allocated instead of statically allocated saving memory and reducing
> > kernel complexity.
>
> hm. I'd suggest to do this without changing request_irq() - and then we
> could avoid the 'massive, every driver affected' change, right?

if request_irq() changes we might as well make a variant that takes a
PCI device struct rather than a number, for the 99% of PCI drivers that
use that.. (and then msi and other stuff becomes simpler :)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

benh at kernel

Feb 18, 2007, 4:25 PM

Post #31 of 40 (1227 views)

Permalink

On Sun, 2007-02-18 at 22:24 +0100, Arjan van de Ven wrote:
> On Fri, 2007-02-16 at 13:41 +0100, Ingo Molnar wrote:
> > * Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > > So I propose we remove all assumptions from the code that we actually
> > > have an array of irqs. That will allow for irq_desc to be dynamically
> > > allocated instead of statically allocated saving memory and reducing
> > > kernel complexity.
> >
> > hm. I'd suggest to do this without changing request_irq() - and then we
> > could avoid the 'massive, every driver affected' change, right?
>
> if request_irq() changes we might as well make a variant that takes a
> PCI device struct rather than a number, for the 99% of PCI drivers that
> use that.. (and then msi and other stuff becomes simpler :)

As a matter of fact, if IRQs has to be handled properly as resources of
their respective devices, I think request_irq replacement should take a
struct device...

In fact, having IRQs hanging off their respective devices would give a
proper way to access them via sysfs and implement the affinity etc...
thus providing a long term replacement for the current number based
APIs.

In addition, to facilitate the job of things like IRQ balancing daemons,
a /sys/irqs/ could be created containing symlinks to all irqs in the
system.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

ebiederm at xmission

Feb 27, 2007, 12:29 PM

Post #32 of 40 (1214 views)

Permalink

A quick update. I did some work on this and have some observations.

- Every back end irq implementation seems to have a different name for
the structure that describes irqs. So picking struct irq which is
different from everything seems to make sense. At the very least
a empty struct irq can be embedded in the architecture specific
irq description structure and we can use container_of to get back
to it.

- The name struct irq conflicts with a spurious definition in
include/pcmcia/cs.h, but otherwise is fine. So it probably makes
sense to use.

- Updating all of the drivers is going to be a pain for precisely the
reason we need to update the drivers. irq numbers are not handled
at all cleanly. In attempting just a few conversions I have seen
irq number stashed in everything for a u8 to a u64. I have seen
all values <= 0 thrown out as invalid.

So we have a decade or more of accumulated inconsistencies and it is
getting painful to work with. Changing the type to something that
won't support all of the old hacks should help cleanup the code.

- Because drivers are not consistent in their handling of irq numbers
whatever path I take to a conversion each patch needs to be thought
about instead of just performed. So I'm going to have to double
the API so a gradual conversion is possible.

- Converting the genirq code to use struct irq_desc pointers
throughout (instead of unsigned int irq) is straight forward and
mindless. Though it is tedious. Like I expected the drivers to be
:(

So it looks like all I need to do to convert the genirq backend is
to break the work up into small enough patches that each patch is
obviously correct. Then compiling on the different architectures
can just serve as a spot check, not as absolutely required step
during the code conversion.

- The converted genirq code was short and easier to follow. Mostly
because I got to kill all of the if (irq >= NR_IRQS) tests...
Null pointer dereferences are your friends!

- The are only 3 or 4 arrays of size NR_IRQS in non-architecture code
and I have patches in my queue to kill them, so that isn't too bad.

- All of the drivers that handle irqs need to be touched because one
of the parameters to the irq handler is the interrupt number. So
that needs to be converted.

So I think the path should be:

* Kill the arrays of size NR_IRQS in non-arch code.

* Add a variation of the API in interrupt.h that uses
"struct irq *irq" instead of "unsigned int irq"

Probably replacing request_irq with irq_request or something
trivial like that.

This will need to touch all of different irq implementation back
ends, but only very lightly.

* Convert the generic irq code to use struct irq * everywhere it
current uses "unsigned int irq".

* Start on the conversions of drivers and subsystems picking on
the easy ones first :)

* Adding for_each_irq_desc() and similar helpers to the generic
irq code.

* Add support in the generic irq code for architectures that don't
have a giant array of irqs.

* Convert x86_64 and i386 to dynamically allocate their irqs.
Routines using the old interfaces will be no longer O(1) more
likely O(N). So will be slow but request_irq and free_irq are
no where near the fast path so it doesn't matter. enable_irq
and disable_irq are the only cases that might matter and they
occur rarely enough fixing the drivers that matter should not
be a problem.

* Ultimately finish converting all of the drivers and remove the
compatibility cruft.

I will look at getting things started and some patches into -mm
sometime next month.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

arnd at arndb

Feb 27, 2007, 4:41 PM

Post #33 of 40 (1221 views)

Permalink

On Tuesday 27 February 2007, Eric W. Biederman wrote:
> * Add a variation of the API in interrupt.h that uses
> "struct irq *irq" instead of "unsigned int irq"
>
> Probably replacing request_irq with irq_request or something
> trivial like that.
>
> This will need to touch all of different irq implementation back
> ends, but only very lightly.
>
> * Convert the generic irq code to use struct irq * everywhere it
> current uses "unsigned int irq".
>
> * Start on the conversions of drivers and subsystems picking on
> the easy ones first :)

Introducing the irq_request() etc. functions that take a struct irq*
instead of an int sounds good, but I'd hope we can avoid using those
in device drivers and do a separate abstraction for each bus_type
that deals with interrupts. I'm not sure if that's possible for
each bus_type, but the ones I have worked with in the past should
allow that:

pci: each device/function has a unique irq, drivers need not know
about it afaics.
isa/pnp: numbers from 1 to 15 are the right abstraction here, that
how isa has worked for ages.
s390: got rid of irq numbers already
ofw: an open firmware device can have a number of interrupts, but
like PCI, the driver only needs to know things like 'first
irq of this device', not how it's connected
ps3: irqs are requested from the firmware for each device, this
can happen under the covers.
mmc, usb, phy, ieee1394: these already have a higl-level abstraction
for interrupt events
platform: dunno, probably these really should use the struct irq
directly
eisa, mca, pcmcia, zorro, ...: no idea, but possibly similar to PCI.

Note that we can even start converting device drivers first, before
moving away from irq numbers. A typical PCI driver should get
somewhat simpler by the conversion, and when they are all converted,
we can replace pci_dev->irq with a struct irq* under the covers.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

ebiederm at xmission

Feb 27, 2007, 11:20 PM

Post #34 of 40 (1209 views)

Permalink

Arnd Bergmann <arnd@arndb.de> writes:

>
> Introducing the irq_request() etc. functions that take a struct irq*
> instead of an int sounds good, but I'd hope we can avoid using those
> in device drivers and do a separate abstraction for each bus_type
> that deals with interrupts. I'm not sure if that's possible for
> each bus_type, but the ones I have worked with in the past should
> allow that:
>
> pci: each device/function has a unique irq, drivers need not know
> about it afaics.
Then there is msi and with msi-x you can have up to 4K irqs.

> isa/pnp: numbers from 1 to 15 are the right abstraction here, that
> how isa has worked for ages.

There is some truth there yes. But for ISA the numbers are really
0 to 15.

> s390: got rid of irq numbers already

Yes. I should really look at that more and see if I could bring
s390 into the generic irq code with my planned changes.

> ofw: an open firmware device can have a number of interrupts, but
> like PCI, the driver only needs to know things like 'first
> irq of this device', not how it's connected
Yes.

> ps3: irqs are requested from the firmware for each device, this
> can happen under the covers.
> mmc, usb, phy, ieee1394: these already have a higl-level abstraction
> for interrupt events
> platform: dunno, probably these really should use the struct irq
> directly
> eisa, mca, pcmcia, zorro, ...: no idea, but possibly similar to PCI.

Largely for pci we already have dev->irq and the device just calls
request_irq to get things going. The challenge is that the token
in dev->irq get's looked at.

> Note that we can even start converting device drivers first, before
> moving away from irq numbers. A typical PCI driver should get
> somewhat simpler by the conversion, and when they are all converted,
> we can replace pci_dev->irq with a struct irq* under the covers.

Reasonable if it is easy and straight forward.
Something like pci_request_irq(dev,....) and the helper looks at
dev->irq under the covers and calls request_irq or whatever makes
sense. Is this what you are thinking. Examples would help me here.

What I really object to is not the irq numbers. As an arbitrary number
does not impose limits. What I object to is drivers that can't handle the
full range of numbers, and the limits imposed upon those numbers when
you require them to be indexes into an array.

For talking to user space I expect we will have numbers for a long time
to come yet.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

benh at kernel

Feb 28, 2007, 12:09 AM

Post #35 of 40 (1222 views)

Permalink

> What I really object to is not the irq numbers. As an arbitrary number
> does not impose limits. What I object to is drivers that can't handle the
> full range of numbers, and the limits imposed upon those numbers when
> you require them to be indexes into an array.
>
> For talking to user space I expect we will have numbers for a long time
> to come yet.

I wouldn't bother too much about going into bus specific bits like
irq_request(dev, ...). Well, actually, I _do_ think it's a good thing to
pass the struct device to irq_request but that's a different issue
completely.

I think bus types should provide bus specific helpers to obtain the
struct irq *'s for a given device on that bus, but the API for
requesting/freeing them shall remain generic.

Ben.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

arnd at arndb

Feb 28, 2007, 4:24 AM

Post #36 of 40 (1227 views)

Permalink

On Wednesday 28 February 2007, Eric W. Biederman wrote:
> Arnd Bergmann <arnd@arndb.de> writes:
>
> >
> > Introducing the irq_request() etc. functions that take a struct irq*
> > instead of an int sounds good, but I'd hope we can avoid using those
> > in device drivers and do a separate abstraction for each bus_type
> > that deals with interrupts. I'm not sure if that's possible for
> > each bus_type, but the ones I have worked with in the past should
> > allow that:
> >
> > pci: each device/function has a unique irq, drivers need not know
> > about it afaics.
> Then there is msi and with msi-x you can have up to 4K irqs.

I have to admit I still don't really understand how this works
at all. Can a driver that uses msi-x have different handlers
for each of those interrupts registered simultaneously?

I would expect that instead there should be only one 'struct irq'
for the device, with the handler getting a 12 bit number argument.

> > s390: got rid of irq numbers already
>
> Yes. I should really look at that more and see if I could bring
> s390 into the generic irq code with my planned changes.

I don't think there is much point in changing the s390 code, but
the way it is solved there may be interesting for other buses
as well. The interrupt handler there is not being registered
explicitly, but is part of the driver (in case of subchannel)
or of the device (in case of ccw_device) data structure.

Similarly, in a pci device, one could imagine that the
struct pci_driver contains a irq_handler_t member that
is registered from the pci_device_probe() function
if present.

> > Note that we can even start converting device drivers first, before
> > moving away from irq numbers. A typical PCI driver should get
> > somewhat simpler by the conversion, and when they are all converted,
> > we can replace pci_dev->irq with a struct irq* under the covers.
>
> Reasonable if it is easy and straight forward.
> Something like pci_request_irq(dev,....) and the helper looks at
> dev->irq under the covers and calls request_irq or whatever makes
> sense. Is this what you are thinking. Examples would help me here.

Ok, I had an example in on of my previous posts, but based on the
discussion since then, it has become significantly simpler, basically
reducing the work to

struct irq *pci_irq_request(struct pci_device *dev,
irq_handler_t handler)
{
if (!dev->irq)
return -ENODEV;

return irq_request(irq, handler, IRQF_SHARED,
&dev->driver->name, dev);
}
int pci_irq_free(struct pci_device *dev)
{
return irq_free(dev->irq, dev);
}

The most significant change of this to the current code
would be that we can pass arguments down to irq_request
automatically, e.g. the irq handler can always get the
pci_device as its dev_id.

> For talking to user space I expect we will have numbers for a long time
> to come yet.

I was wondering about that. Do you only mean /proc/interrupts or
are there other user interfaces we need to worry about?
For /proc/interrupts, what could break if we have interrupt numbers
only local to each controller and potentially duplicate numbers
in the list? It's good to be paranoid about changes to proc files,
but I can definitely see value in having meaningful interrupt
numbers in there instead of making up a more or less random mapping
to a flat number space.

Arnd <><
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

segher at kernel

Feb 28, 2007, 5:02 AM

Post #37 of 40 (1225 views)

Permalink

>>> pci: each device/function has a unique irq, drivers need not know
>>> about it afaics.
>> Then there is msi and with msi-x you can have up to 4K irqs.
>
> I have to admit I still don't really understand how this works
> at all. Can a driver that uses msi-x have different handlers
> for each of those interrupts registered simultaneously?

Yes. It doesn't have to, though.

> I would expect that instead there should be only one 'struct irq'
> for the device, with the handler getting a 12 bit number argument.

Why? The device really generates many different interrupts,
why hide this fact.

>> For talking to user space I expect we will have numbers for a long
>> time
>> to come yet.
>
> I was wondering about that. Do you only mean /proc/interrupts or
> are there other user interfaces we need to worry about?

There's the IRQ affinity stuff too.

> For /proc/interrupts, what could break if we have interrupt numbers
> only local to each controller and potentially duplicate numbers
> in the list? It's good to be paranoid about changes to proc files,
> but I can definitely see value in having meaningful interrupt
> numbers in there instead of making up a more or less random mapping
> to a flat number space.

Duplicate all this stuff into /sys in a sane format (*) and
wait until userland catches up, then throw away the /proc
interfaces. It'll take a while, and until that day you will
have to keep *some* interrupt number <-> interrupt bijection.
Userland tools that think they know what interrupt number
should be what are dead already.

(*) i.e., exposing the interrupt tree as a tree, cascaded
controllers and all.

Segher

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

ebiederm at xmission

Feb 28, 2007, 5:28 AM

Post #38 of 40 (1224 views)

Permalink

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> What I really object to is not the irq numbers. As an arbitrary number
>> does not impose limits. What I object to is drivers that can't handle the
>> full range of numbers, and the limits imposed upon those numbers when
>> you require them to be indexes into an array.
>>
>> For talking to user space I expect we will have numbers for a long time
>> to come yet.
>
> I wouldn't bother too much about going into bus specific bits like
> irq_request(dev, ...). Well, actually, I _do_ think it's a good thing to
> pass the struct device to irq_request but that's a different issue
> completely.

Well irq_request is probably a bit late for associating an irq with
a struct device. And I don't see how to get the device name from that
but making that association wouldn't be a bad thing.

> I think bus types should provide bus specific helpers to obtain the
> struct irq *'s for a given device on that bus, but the API for
> requesting/freeing them shall remain generic.

Yes. But if you can do a good job of wrapping them so a driver
wouldn't need to care at all that could help simplify drivers and
remove one more tidbit of complexity from drivers.

However for a widespread change. The less you have to think about
it the more quickly you can get it completed. So I would rather
do several wide spread changes in succession, that I don't have to
think about much to do (i.e. the change mostly meets the obviously
correct requirement). Then to do one single change that I have to
think about harder to accomplish.

The more thinking comes into the picture the more you open yourself up
to breaking something by accident because it wasn't clear how the code
should be changed, and the more it slows down the conversion.

Conversions where we have had to think about things are notoriously slow
to complete. Even if they are a good thing to do. The examples I can
think of are the kthread API and the DMA api. Last I checked there
were still a few drivers in the kernel using virt_to_bus...

I don't really get the benefits I'm after unless the conversion can
actually be completed for everything. So the more I have to think about
any one piece the less I intend to do it.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

ebiederm at xmission

Feb 28, 2007, 5:53 AM

Post #39 of 40 (1212 views)

Permalink

Arnd Bergmann <arnd@arndb.de> writes:

>
> I have to admit I still don't really understand how this works
> at all. Can a driver that uses msi-x have different handlers
> for each of those interrupts registered simultaneously?

Yes, and the irqs can be routed at different cpus independently.
However not all hardware supports all 4K irqs. 4K is the implementation
defined maximu. Although infiniband HCA's are rumored to support a
lot of irqs and it isn't uncommon for simpler nics to support 4 or so.

Conceptually think of it as having an irq controller embedded in your
pci device.

The MSI messages are writes to special addresses that are then
converted into CPU interrupts.

> I would expect that instead there should be only one 'struct irq'
> for the device, with the handler getting a 12 bit number argument.

No. That would be unnecessary coalescing. Even if that was what the
hardware layer gave us the (and it doesn't) the generic layers should
demux these things as much as is reasonable.

>> > s390: got rid of irq numbers already
>>
>> Yes. I should really look at that more and see if I could bring
>> s390 into the generic irq code with my planned changes.
>
> I don't think there is much point in changing the s390 code, but
> the way it is solved there may be interesting for other buses
> as well. The interrupt handler there is not being registered
> explicitly, but is part of the driver (in case of subchannel)
> or of the device (in case of ccw_device) data structure.
>
> Similarly, in a pci device, one could imagine that the
> struct pci_driver contains a irq_handler_t member that
> is registered from the pci_device_probe() function
> if present.

Yes. There is some potential there. Although we would have to go
through an extra hoop to make it a pci specific handler type.

>> > Note that we can even start converting device drivers first, before
>> > moving away from irq numbers. A typical PCI driver should get
>> > somewhat simpler by the conversion, and when they are all converted,
>> > we can replace pci_dev->irq with a struct irq* under the covers.
>>
>> Reasonable if it is easy and straight forward.
>> Something like pci_request_irq(dev,....) and the helper looks at
>> dev->irq under the covers and calls request_irq or whatever makes
>> sense. Is this what you are thinking. Examples would help me here.
>
> Ok, I had an example in on of my previous posts, but based on the
> discussion since then, it has become significantly simpler, basically
> reducing the work to
>
> struct irq *pci_irq_request(struct pci_device *dev,
> irq_handler_t handler)
> {
> if (!dev->irq)
> return -ENODEV;
>
> return irq_request(irq, handler, IRQF_SHARED,
> &dev->driver->name, dev);
> }
> int pci_irq_free(struct pci_device *dev)
> {
> return irq_free(dev->irq, dev);
> }
>
> The most significant change of this to the current code
> would be that we can pass arguments down to irq_request
> automatically, e.g. the irq handler can always get the
> pci_device as its dev_id.

Yes. Mostly. Since dev_id is what is passed back to the irq handler,
it makes sense to pass the device when the irq is registered.
Passing the driver a pointer to the driver specific structure (not
the pci device) make a lot more of sense from an efficiency
standpoint. Now it may make sense to remove the irq parameter
from irq_handler_t, and require drivers to look at their dev_id to
see which irq they really are processing.

There is a danger here of making things so generic you don't have good
performance, or the code becomes unnecessarily complex.

>> For talking to user space I expect we will have numbers for a long time
>> to come yet.
>
> I was wondering about that. Do you only mean /proc/interrupts or
> are there other user interfaces we need to worry about?

Yes. There are other interfaces like /proc/irq/XXX/smp_affinity,
for irq migration. There are also device specific ioctls. There is
lspci. I don't know what all else, and given the current state of the
kernel it is hard to grep for.

> For /proc/interrupts, what could break if we have interrupt numbers
> only local to each controller and potentially duplicate numbers
> in the list? It's good to be paranoid about changes to proc files,
> but I can definitely see value in having meaningful interrupt
> numbers in there instead of making up a more or less random mapping
> to a flat number space.

Well I can have meaning full flat numbers and on i386 and x86_64
except for msi I have that. The problem is that for the numbers
to have meaning I get a very sparse usage of the numbers, because
very frequently the hardware interrupt controllers has pins that
are not connected up, so I have about an order of magnitude more
numbers then I have actually irqs in use. That is before I start
reserving irq numbers for MSI.

For MSI (since they cannot be shared) it would actually make a lot of
sense to make the numbers domain,bus,device,func,(Nth device irq) but
I can't because bus,device,func is 16bits worth of number. Add in
12 more bits for the worst case device assignment and I am up to
28 bits, and the rest of the 32bits can be used for domain or
something like that. So while the current unsigned int irq works fine
for that backing that with a fixed size array allocated at compile
time is just hopeless.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [RFC] killing the NR_IRQS arrays. [ In reply to ]

benh at kernel

Mar 1, 2007, 2:47 AM

Post #40 of 40 (1199 views)

Permalink

> > Similarly, in a pci device, one could imagine that the
> > struct pci_driver contains a irq_handler_t member that
> > is registered from the pci_device_probe() function
> > if present.
>
> Yes. There is some potential there. Although we would have to go
> through an extra hoop to make it a pci specific handler type.

Beware with that approach though. If you are on a shared IRQ line, when
do you start getting called when an IRQ happen (possibly for the "other"
device) ? As soon as you are bound to the device ? But that means
potentially before the driver internal data structures are fully
initialized...

I like the driver having in control the "hooking" of the irq handler,
thus, when it starts being capable of handling interrupts (even if they
aren't initiated by that driver's device).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/