Hi ClamAV team and users,
This is a follow up to my previous posts, which can be found here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html> & here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html>. I wanted to give a summary and make sure the problem identified is clear.
My team and I have noticed that ClamAV can be very slow in scanning certain PDF files. When we investigated the matter, we discovered the potential root cause within ClamAV source code. In https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1984, ClamAV handles PDF document tags. This function comes with a state to properly handle tags that require parameters. However, this state is not reset after parameters are parsed, so parsing is sensitive to the order in which tags are listed in the dictionary.
For example, this collection of headers for a PDF will scan fast because image subtype is before all filters:
```
429 0 obj << /ColorSpace /DeviceRGB /Name /im56 /Height 2850 /Subtype /Image /Filter /FlateDecode /DecodeParms << /Columns 1776 /Colors 3 /Predictor 2 >> /Type /XObject /Width 1776 /Length 25686 /BitsPerComponent 8 /Interpolate true >> stream
```
However, this collection of headers for a PDF will scan slow because image subtype comes after filter (image will be dumped, though it should not be):
```
454 0 obj<</Length 455 0 R/Filter/FlateDecode/DecodeParms<</Columns 1776/Predictor 2/Colors 3>>/Width 1776/Height 2850/BitsPerComponent 8/ColorSpace/DeviceRGB/Interpolate true/Type/XObject/Name/im56/Subtype/Image>>stream
```
Finally, in this line: https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1580, we see references to parameters, but they are used after tags are parsed. And neither DP nor DecodeParms are in `pdfname_actions`, so they are not affecting state.
Slow PDF scanning has been a known problem for 3 years, and it would be nice to see it addressed in a new patch soon.
Again, I?m happy to provide more details if needed. Thank you for your time.
Best,
Eric
________________________________
CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain confidential information of Five9 and/or its affiliated entities. Access by the intended recipient only is authorized. Any liability arising from any party acting, or refraining from acting, on any information contained in this e-mail is hereby excluded. If you are not the intended recipient, please notify the sender immediately, destroy the original transmission and its attachments and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Copyright in this e-mail and any attachments belongs to Five9 and/or its affiliated entities.
This is a follow up to my previous posts, which can be found here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html> & here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html>. I wanted to give a summary and make sure the problem identified is clear.
My team and I have noticed that ClamAV can be very slow in scanning certain PDF files. When we investigated the matter, we discovered the potential root cause within ClamAV source code. In https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1984, ClamAV handles PDF document tags. This function comes with a state to properly handle tags that require parameters. However, this state is not reset after parameters are parsed, so parsing is sensitive to the order in which tags are listed in the dictionary.
For example, this collection of headers for a PDF will scan fast because image subtype is before all filters:
```
429 0 obj << /ColorSpace /DeviceRGB /Name /im56 /Height 2850 /Subtype /Image /Filter /FlateDecode /DecodeParms << /Columns 1776 /Colors 3 /Predictor 2 >> /Type /XObject /Width 1776 /Length 25686 /BitsPerComponent 8 /Interpolate true >> stream
```
However, this collection of headers for a PDF will scan slow because image subtype comes after filter (image will be dumped, though it should not be):
```
454 0 obj<</Length 455 0 R/Filter/FlateDecode/DecodeParms<</Columns 1776/Predictor 2/Colors 3>>/Width 1776/Height 2850/BitsPerComponent 8/ColorSpace/DeviceRGB/Interpolate true/Type/XObject/Name/im56/Subtype/Image>>stream
```
Finally, in this line: https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1580, we see references to parameters, but they are used after tags are parsed. And neither DP nor DecodeParms are in `pdfname_actions`, so they are not affecting state.
Slow PDF scanning has been a known problem for 3 years, and it would be nice to see it addressed in a new patch soon.
Again, I?m happy to provide more details if needed. Thank you for your time.
Best,
Eric
________________________________
CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain confidential information of Five9 and/or its affiliated entities. Access by the intended recipient only is authorized. Any liability arising from any party acting, or refraining from acting, on any information contained in this e-mail is hereby excluded. If you are not the intended recipient, please notify the sender immediately, destroy the original transmission and its attachments and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Copyright in this e-mail and any attachments belongs to Five9 and/or its affiliated entities.