Mailing List Archive

[clamav-users] Slow PDF Scanning pt 3.
Hi ClamAV team and users,

This is a follow up to my previous posts, which can be found here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html> & here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html>. I wanted to give a summary and make sure the problem identified is clear.


My team and I have noticed that ClamAV can be very slow in scanning certain PDF files. When we investigated the matter, we discovered the potential root cause within ClamAV source code. In https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1984, ClamAV handles PDF document tags. This function comes with a state to properly handle tags that require parameters. However, this state is not reset after parameters are parsed, so parsing is sensitive to the order in which tags are listed in the dictionary.



For example, this collection of headers for a PDF will scan fast because image subtype is before all filters:



```

429 0 obj << /ColorSpace /DeviceRGB /Name /im56 /Height 2850 /Subtype /Image /Filter /FlateDecode /DecodeParms << /Columns 1776 /Colors 3 /Predictor 2 >> /Type /XObject /Width 1776 /Length 25686 /BitsPerComponent 8 /Interpolate true >> stream
```

However, this collection of headers for a PDF will scan slow because image subtype comes after filter (image will be dumped, though it should not be):

```
454 0 obj<</Length 455 0 R/Filter/FlateDecode/DecodeParms<</Columns 1776/Predictor 2/Colors 3>>/Width 1776/Height 2850/BitsPerComponent 8/ColorSpace/DeviceRGB/Interpolate true/Type/XObject/Name/im56/Subtype/Image>>stream
```


Finally, in this line: https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1580, we see references to parameters, but they are used after tags are parsed. And neither DP nor DecodeParms are in `pdfname_actions`, so they are not affecting state.



Slow PDF scanning has been a known problem for 3 years, and it would be nice to see it addressed in a new patch soon.



Again, I?m happy to provide more details if needed. Thank you for your time.



Best,

Eric



________________________________

CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain confidential information of Five9 and/or its affiliated entities. Access by the intended recipient only is authorized. Any liability arising from any party acting, or refraining from acting, on any information contained in this e-mail is hereby excluded. If you are not the intended recipient, please notify the sender immediately, destroy the original transmission and its attachments and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Copyright in this e-mail and any attachments belongs to Five9 and/or its affiliated entities.
Re: [clamav-users] Slow PDF Scanning pt 3. [ In reply to ]
Hi Eric,

Thank you for the in-depth analysis of the PDF scanning speed issue.

We took a look at the bytecode (BC) signatures and considering the performance impact and value of the detections we decided to drop these signatures. You should have seen them drop in yesterday's update to the bytecode.cvd database. I'm hopeful that this mostly resolves the concern regarding slow PDF scans.

With regards to your analysis of the PDF object dictionary parsing, I could use your help. You mention that the state is not reset when looking for object dictionary keys which causes the ordering to matter. You implied that this causes ClamAV's PDF parser to fail to extract (dump) some images. We should fix it so that it will correctly extract every image, as image detection is very useful in identifying phishing documents and other malicious documents and emails.

If you have any specific recommendations for fixing this issue, we would appreciate it.

Also, if you have sample files that I could debug which illustrate the image extraction issue you described, I would appreciate a copy.

On a side note, we will be investigating looking into using pdfium or another third-party PDF parser in the future in order to improve detection and performance. It is possible that we will replace our own PDF parser partially or entirely depending on the results of this investigation. I mention this so that you do not spend a tremendous effort on this issue.

Regards,
Micah


Micah Snyder (they/them)
ClamAV Development
Talos
Cisco Systems, Inc.
________________________________
From: clamav-users <clamav-users-bounces@lists.clamav.net> on behalf of Eric Zhou via clamav-users <clamav-users@lists.clamav.net>
Sent: Thursday, February 22, 2024 2:29 PM
To: clamav-users@lists.clamav.net <clamav-users@lists.clamav.net>
Cc: Eric Zhou <Eric.Zhou@five9.com>
Subject: [clamav-users] Slow PDF Scanning pt 3.


Hi ClamAV team and users,



This is a follow up to my previous posts, which can be found here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html> & here<https://lists.clamav.net/pipermail/clamav-users/2024-February/013744.html>. I wanted to give a summary and make sure the problem identified is clear.



My team and I have noticed that ClamAV can be very slow in scanning certain PDF files. When we investigated the matter, we discovered the potential root cause within ClamAV source code. In https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1984, ClamAV handles PDF document tags. This function comes with a state to properly handle tags that require parameters. However, this state is not reset after parameters are parsed, so parsing is sensitive to the order in which tags are listed in the dictionary.



For example, this collection of headers for a PDF will scan fast because image subtype is before all filters:



```

429 0 obj << /ColorSpace /DeviceRGB /Name /im56 /Height 2850 /Subtype /Image /Filter /FlateDecode /DecodeParms << /Columns 1776 /Colors 3 /Predictor 2 >> /Type /XObject /Width 1776 /Length 25686 /BitsPerComponent 8 /Interpolate true >> stream

```



However, this collection of headers for a PDF will scan slow because image subtype comes after filter (image will be dumped, though it should not be):



```

454 0 obj<</Length 455 0 R/Filter/FlateDecode/DecodeParms<</Columns 1776/Predictor 2/Colors 3>>/Width 1776/Height 2850/BitsPerComponent 8/ColorSpace/DeviceRGB/Interpolate true/Type/XObject/Name/im56/Subtype/Image>>stream

```



Finally, in this line: https://github.com/Cisco-Talos/clamav/blob/5f934c16b47591157a7082b71e751c45f095e2c8/libclamav/pdf.c#L1580, we see references to parameters, but they are used after tags are parsed. And neither DP nor DecodeParms are in `pdfname_actions`, so they are not affecting state.



Slow PDF scanning has been a known problem for 3 years, and it would be nice to see it addressed in a new patch soon.



Again, I?m happy to provide more details if needed. Thank you for your time.



Best,

Eric





________________________________

CONFIDENTIALITY NOTICE: This e-mail and any files attached may contain confidential information of Five9 and/or its affiliated entities. Access by the intended recipient only is authorized. Any liability arising from any party acting, or refraining from acting, on any information contained in this e-mail is hereby excluded. If you are not the intended recipient, please notify the sender immediately, destroy the original transmission and its attachments and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Copyright in this e-mail and any attachments belongs to Five9 and/or its affiliated entities.
Re: [clamav-users] Slow PDF Scanning pt 3. [ In reply to ]
Hi all,

> You implied that this causes ClamAV's PDF parser to fail to extract
> (dump) some images.  We should fix it so that it will correctly extract
> every image, as image detection is very useful in identifying phishing
> documents and other malicious documents and emails.

Good news ! I'm waiting for that !


--
Cordialement / Best regards,

Arnaud Jacques
Gérant de SecuriteInfo.com

Téléphone : +33-(0)3.60.47.09.81
E-mail : aj@securiteinfo.com
Site web : https://www.securiteinfo.com
Facebook : https://www.facebook.com/pages/SecuriteInfocom/132872523492286
Twitter : @SecuriteInfoCom
Writing signatures for ClamAV antivirus since 2006
_______________________________________________

Manage your clamav-users mailing list subscription / unsubscribe:
https://lists.clamav.net/mailman/listinfo/clamav-users


Help us build a comprehensive ClamAV guide:
https://github.com/Cisco-Talos/clamav-documentation

https://docs.clamav.net/#mailing-lists-and-chat