4<\/jats:sub> -enzymes from phylogenetic ancestors. These structures resolve yet unknown conformational intermediates and provide the first detailed view on the large conformational transitions of the protein in the catalytic cycle. Independently performed unrestrained MD simulations and configurational free energy calculations also identified these intermediates. In all, our experimental and computational data reveal strict coupling of the CD swiveling motion to the conformational state of the NBD. Moreover, structural asymmetries and nucleotide binding states in the PPDK dimer support an alternate binding change mechanism for this intriguing bioenergetic enzyme.<\/jats:p>","DOI":"10.1038\/srep45389","type":"journal-article","created":{"date-parts":[[2017,3,30]],"date-time":"2017-03-30T09:34:13Z","timestamp":1490866453000},"update-policy":"http:\/\/dx.doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":18,"title":["Structural intermediates and directionality of the swiveling motion of Pyruvate Phosphate Dikinase"],"prefix":"10.1038","volume":"7","author":[{"given":"Alexander","family":"Minges","sequence":"first","affiliation":[]},{"given":"Daniel","family":"Ciupka","sequence":"additional","affiliation":[]},{"given":"Christian","family":"Winkler","sequence":"additional","affiliation":[]},{"given":"Astrid","family":"H\u00f6ppner","sequence":"additional","affiliation":[]},{"given":"Holger","family":"Gohlke","sequence":"additional","affiliation":[]},{"given":"Georg","family":"Groth","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2017,3,30]]},"reference":[{"key":"BFsrep45389_CR1","doi-asserted-by":"publisher","first-page":"255","DOI":"10.1146\/annurev.pp.36.060185.001351","volume":"36","author":"GE Edwards","year":"1985","unstructured":"Edwards, G. E., Nakamoto, H., Burnell, J. N. & Hatch, M. D. Pyruvate, Pi Dikinase and NADP-Malate Dehydrogenase in C4 Photosynthesis: Properties and Mechanism of Light\/Dark Regulation. Ann. Rev. Plant Physio. 36, 255\u2013286 (1985).","journal-title":"Ann. Rev. Plant Physio"},{"key":"BFsrep45389_CR2","doi-asserted-by":"publisher","first-page":"3083","DOI":"10.1093\/jxb\/err058","volume":"62","author":"CJ Chastain","year":"2011","unstructured":"Chastain, C. J. et al. Functional evolution of C4 pyruvate, orthophosphate dikinase. J. Exp. Bot. 62, 3083\u20133091 (2011).","journal-title":"J. Exp. Bot."},{"key":"BFsrep45389_CR3","first-page":"607","volume":"6","author":"MD Hatch","year":"1979","unstructured":"Hatch, M. D. Regulation of C 4 Photosynthesis: Factors Affecting Cold-Mediated Inactivation and Reactivation of Pyruvate, P I Dikinase. Aust. J. Plant Physiol. 6, 607 (1979).","journal-title":"Aust. J. Plant Physiol."},{"key":"BFsrep45389_CR4","doi-asserted-by":"publisher","first-page":"826","DOI":"10.1104\/pp.62.5.826","volume":"62","author":"K Shirahashi","year":"1978","unstructured":"Shirahashi, K., Hayakawa, S. & Sugiyama, T. Cold Lability of Pyruvate, Orthophosphate Dikinase in the Maize Leaf. Plant Physiol. 62, 826\u2013830 (1978).","journal-title":"Plant Physiol."},{"key":"BFsrep45389_CR5","doi-asserted-by":"publisher","first-page":"2862","DOI":"10.1021\/bi00739a014","volume":"12","author":"T Sugiyama","year":"1973","unstructured":"Sugiyama, T. Purification, molecular, and catalytic properties of pyruvate phosphate dikinase from the maize leaf. Biochemistry-US. 12, 2862\u20132868 (1973).","journal-title":"Biochemistry-US"},{"key":"BFsrep45389_CR6","doi-asserted-by":"publisher","first-page":"523","DOI":"10.1016\/S0981-9428(03)00065-2","volume":"41","author":"CJ Chastain","year":"2003","unstructured":"Chastain, C. J. & Chollet, R. Regulation of pyruvate, orthophosphate dikinase by ADP-\/Pi-dependent reversible phosphorylation in C3 and C4 plants. Plant Physiol. Bioch. 41, 523\u2013532 (2003).","journal-title":"Plant Physiol. Bioch"},{"key":"BFsrep45389_CR7","doi-asserted-by":"publisher","first-page":"924","DOI":"10.1007\/s00425-006-0259-3","volume":"224","author":"CJ Chastain","year":"2006","unstructured":"Chastain, C. J., Heck, J. W., Colquhoun, T. A., Voge, D. G. & Gu, X.-Y. Posttranslational regulation of pyruvate, orthophosphate dikinase in developing rice (Oryza sativa) seeds. Planta 224, 924\u2013934 (2006).","journal-title":"Planta"},{"key":"BFsrep45389_CR8","doi-asserted-by":"publisher","first-page":"2652","DOI":"10.1073\/pnas.93.7.2652","volume":"93","author":"O Herzberg","year":"1996","unstructured":"Herzberg, O. et al. Swiveling-domain mechanism for enzymatic phosphotransfer between remote reaction sites. P. Natl. Acad. Sci. USA. 93, 2652\u20132657 (1996).","journal-title":"P. Natl. Acad. Sci. USA"},{"key":"BFsrep45389_CR9","doi-asserted-by":"publisher","first-page":"14845","DOI":"10.1021\/bi701848w","volume":"46","author":"K Lim","year":"2007","unstructured":"Lim, K. et al. Swiveling Domain Mechanism in Pyruvate Phosphate Dikinase. Biochemistry-US. 46, 14845\u201314853 (2007).","journal-title":"Biochemistry-US"},{"key":"BFsrep45389_CR10","doi-asserted-by":"publisher","first-page":"780","DOI":"10.1021\/bi011799+","volume":"41","author":"O Herzberg","year":"2002","unstructured":"Herzberg, O. et al. Pyruvate site of pyruvate phosphate dikinase: crystal structure of the enzyme-phosphonopyruvate complex, and mutant analysis. Biochemistry-US. 41, 780\u2013787 (2002).","journal-title":"Biochemistry-US"},{"key":"BFsrep45389_CR11","doi-asserted-by":"publisher","first-page":"1136","DOI":"10.1021\/bi0484522","volume":"44","author":"T Nakanishi","year":"2005","unstructured":"Nakanishi, T., Nakatsu, T., Matsuoka, M., Sakata, K. & Kato, H. Crystal Structures of Pyruvate Phosphate Dikinase from Maize Revealed an Alternative Conformation in the Swiveling-Domain Motion. Biochemistry-US. 44, 1136\u20131144 (2005).","journal-title":"Biochemistry-US"},{"key":"BFsrep45389_CR12","doi-asserted-by":"publisher","first-page":"635","DOI":"10.1016\/S0092-8674(00)80525-5","volume":"90","author":"S Korolev","year":"1997","unstructured":"Korolev, S., Hsieh, J., Gauss, G. H., Lohman, T. M. & Waksman, G. Major Domain Swiveling Revealed by the Crystal Structures of Complexes of E. coli Rep Helicase Bound to Single-Stranded DNA and ADP. Cell 90, 635\u2013647 (1997).","journal-title":"Cell"},{"key":"BFsrep45389_CR13","doi-asserted-by":"publisher","first-page":"10586","DOI":"10.1038\/ncomms10586","volume":"7","author":"K Nguyen","year":"2016","unstructured":"Nguyen, K. & Whitford, P. C. Steric interactions lead to collective tilting motion in the ribosome during mRNA\u2013tRNA translocation. Nat Comms 7, 10586 (2016).","journal-title":"Nat Comms"},{"key":"BFsrep45389_CR14","doi-asserted-by":"publisher","first-page":"827","DOI":"10.1126\/science.1117230","volume":"310","author":"BS Schuwirth","year":"2005","unstructured":"Schuwirth, B. S. Structures of the Bacterial Ribosome at 3.5 A Resolution. Science 310, 827\u2013834 (2005).","journal-title":"Science"},{"key":"BFsrep45389_CR15","doi-asserted-by":"publisher","first-page":"677","DOI":"10.1038\/33612","volume":"392","author":"Z Zhang","year":"1998","unstructured":"Zhang, Z. et al. Electron transfer by domain movement in cytochrome bc1. Nature 392, 677\u2013684 (1998).","journal-title":"Nature"},{"key":"BFsrep45389_CR16","doi-asserted-by":"publisher","first-page":"3803","DOI":"10.1073\/pnas.1523614113","volume":"113","author":"X Qi","year":"2016","unstructured":"Qi, X. et al. Structural basis of rifampin inactivation by rifampin phosphotransferase. P. Natl. Acad. Sci. USA 113, 3803\u20133808 (2016).","journal-title":"P. Natl. Acad. Sci. USA"},{"key":"BFsrep45389_CR17","doi-asserted-by":"publisher","first-page":"519","DOI":"10.1073\/pnas.1518614113","volume":"113","author":"RH-J Wei\u00dfe","year":"2016","unstructured":"Wei\u00dfe, R. H.-J., Faust, A., Schmidt, M., Sch\u00f6nheit, P. & Scheidig, A. J. Structure of NDP-forming Acetyl-CoA synthetase ACD1 reveals a large rearrangement for phosphoryl transfer. P. Natl. Acad. Sci. USA. 113, 519\u2013528 (2016).","journal-title":"P. Natl. Acad. Sci. USA"},{"key":"BFsrep45389_CR18","doi-asserted-by":"publisher","first-page":"3115","DOI":"10.1021\/bi9621977","volume":"36","author":"I Wong","year":"1997","unstructured":"Wong, I. & Lohman, T. M. A two-site mechanism for ATP hydrolysis by the asymmetric rep dimer p2s as revealed by site-specific inhibition with ADP-AlF4. Biochemistry 36, 3115\u20133125 (1997).","journal-title":"Biochemistry"},{"key":"BFsrep45389_CR19","doi-asserted-by":"publisher","first-page":"1417","DOI":"10.1016\/S0022-2836(02)00113-4","volume":"318","author":"LW Cosenza","year":"2002","unstructured":"Cosenza, L. W., Bringaud, F., Baltz, T. & Vellieux, F. M. The 3.0\u00c5 Resolution Crystal Structure of Glycosomal Pyruvate Phosphate Dikinase from Trypanosoma brucei . J. Mol. Biol. 318, 1417\u20131432 (2002).","journal-title":"J. Mol. Biol."},{"key":"BFsrep45389_CR20","doi-asserted-by":"publisher","first-page":"16218","DOI":"10.1073\/pnas.0607587103","volume":"103","author":"A Teplyakov","year":"2006","unstructured":"Teplyakov, A. et al. Structure of phosphorylated enzyme I, the phosphoenolpyruvate:sugar phosphotransferase system sugar translocation signal protein. P. Natl. Acad. Sci. USA. 103, 16218\u201316223 (2006).","journal-title":"P. Natl. Acad. Sci. USA"},{"key":"BFsrep45389_CR21","doi-asserted-by":"publisher","first-page":"322","DOI":"10.1016\/S0076-6879(78)49017-2","volume":"49","author":"AS Mildvan","year":"1978","unstructured":"Mildvan, A. S. & Gupta, R. K. Nuclear relaxation measurements of the geometry of enzyme-bound substrates and analogs. Methods. Enzymol. 49, 322\u2013359 (1978).","journal-title":"Methods. Enzymol"},{"key":"BFsrep45389_CR22","doi-asserted-by":"publisher","first-page":"877","DOI":"10.1146\/annurev.bi.49.070180.004305","volume":"49","author":"JR Knowles","year":"1980","unstructured":"Knowles, J. R. Enzyme-Catalyzed Phosphoryl Transfer Reactions. Annu. Rev. Biochem. 49, 877\u2013919 (1980).","journal-title":"Annu. Rev. Biochem."},{"key":"BFsrep45389_CR23","doi-asserted-by":"publisher","first-page":"977","DOI":"10.1093\/emboj\/17.4.977","volume":"17","author":"L Esser","year":"1998","unstructured":"Esser, L. Synapsin I is structurally similar to ATP-utilizing enzymes. EMBO J. 17, 977\u2013984 (1998).","journal-title":"EMBO J"},{"key":"BFsrep45389_CR24","doi-asserted-by":"publisher","first-page":"622","DOI":"10.1002\/prot.22910","volume":"79","author":"BR Novak","year":"2010","unstructured":"Novak, B. R., Moldovan, D., Waldrop, G. L. & de Queiroz, M. S. Behavior of the ATP grasp domain of biotin carboxylase monomers and dimers studied using molecular dynamics simulations. Proteins 79, 622\u2013632 (2010).","journal-title":"Proteins"},{"key":"BFsrep45389_CR25","doi-asserted-by":"publisher","first-page":"37630","DOI":"10.1074\/jbc.M105631200","volume":"276","author":"D Ye","year":"2001","unstructured":"Ye, D. et al. Investigation of the Catalytic Site within the ATP-Grasp Domain of Clostridium symbiosum Pyruvate Phosphate Dikinase. J. Biol. Chem. 276, 37630\u201337639 (2001).","journal-title":"J. Biol. Chem."},{"key":"BFsrep45389_CR26","doi-asserted-by":"publisher","first-page":"669","DOI":"10.1038\/35089509","volume":"2","author":"M Yoshida","year":"2001","unstructured":"Yoshida, M., Muneyuki, E. & Hisabori, T. ATP synthase\u2014a marvellous rotary engine of the cell. Nat. Rev. Mol. Cell Biol. 2, 669\u2013677 (2001).","journal-title":"Nat. Rev. Mol. Cell Biol."},{"key":"BFsrep45389_CR27","doi-asserted-by":"publisher","first-page":"898","DOI":"10.1038\/35073513","volume":"410","author":"R Yasuda","year":"2001","unstructured":"Yasuda, R., Noji, H., Yoshida, M., Kinosita, K. & Itoh, H. Resolution of distinct rotational substeps by submillisecond kinetic analysis of F1-ATPase. Nature 410, 898\u2013904 (2001).","journal-title":"Nature"},{"key":"BFsrep45389_CR28","doi-asserted-by":"publisher","first-page":"5433","DOI":"10.1038\/sj.emboj.7601410","volume":"25","author":"V Kabaleeswaran","year":"2006","unstructured":"Kabaleeswaran, V., Puri, N., Walker, J. E., Leslie, A. G. W. & Mueller, D. M. Novel features of the rotary catalytic mechanism revealed in the structure of yeast F1 ATPase. The EMBO Journal 25, 5433\u20135442 (2006).","journal-title":"The EMBO Journal"},{"key":"BFsrep45389_CR29","doi-asserted-by":"publisher","first-page":"309","DOI":"10.1016\/j.cell.2007.05.020","volume":"130","author":"K Adachi","year":"2007","unstructured":"Adachi, K. et al. Coupling of Rotation and Catalysis in F1-ATPase Revealed by Single-Molecule Imaging and Manipulation. Cell 130, 309\u2013321 (2007).","journal-title":"Cell"},{"key":"BFsrep45389_CR30","doi-asserted-by":"publisher","first-page":"2498","DOI":"10.1101\/gr.1239303","volume":"13","author":"P Shannon","year":"2003","unstructured":"Shannon, P. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 13, 2498\u20132504 (2003).","journal-title":"Genome Res."},{"key":"BFsrep45389_CR31","doi-asserted-by":"publisher","first-page":"179","DOI":"10.1016\/j.tibs.2011.01.002","volume":"36","author":"NT Doncheva","year":"2011","unstructured":"Doncheva, N. T., Klein, K., Domingues, F. S. & Albrecht, M. Analyzing and visualizing residue networks of protein structures. Trends Biochem. Sci. 36, 179\u2013182 (2011).","journal-title":"Trends Biochem. Sci."},{"key":"BFsrep45389_CR32","first-page":"163","volume":"19","author":"F Glaser","year":"2003","unstructured":"Glaser, F. et al. ConSurf: Identification of Functional Regions in Proteins by Surface-Mapping of Phylogenetic Information. Method. Biochem. Anal. 19, 163\u2013164 (2003).","journal-title":"Method. Biochem. Anal"},{"key":"BFsrep45389_CR33","doi-asserted-by":"publisher","first-page":"1604","DOI":"10.1021\/ci100461k","volume":"51","author":"A Ahmed","year":"2011","unstructured":"Ahmed, A., Rippmann, F., Barnickel, G. & Gohlke, H. A Normal Mode-Based Geometric Simulation Approach for Exploring Biologically Relevant Conformational Transitions in Proteins. J. Chem. Inf. Model. 51, 1604\u20131622 (2011).","journal-title":"J. Chem. Inf. Model."},{"key":"BFsrep45389_CR34","doi-asserted-by":"crossref","unstructured":"Howard, J. Motor Proteins as Nanomachines: The Roles of Thermal Fluctuations in Generating Force and Motion. Biological Physics 47\u201359 (2010).","DOI":"10.1007\/978-3-0346-0428-4_3"},{"key":"BFsrep45389_CR35","doi-asserted-by":"crossref","unstructured":"Feynman, R., Leighton, R., Sands, M. & Hafner, E. The Feynman Lectures on Physics; Vol. I, vol. 33 (AAPT, 1965).","DOI":"10.1119\/1.1972241"},{"key":"BFsrep45389_CR36","doi-asserted-by":"publisher","first-page":"255","DOI":"10.1016\/S0096-4174(18)30128-8","volume":"7","author":"AF Huxley","year":"1957","unstructured":"Huxley, A. F. A hypothesis for the mechanism of contraction of muscle. Prog Biophys Biophys Chem 7, 255\u2013318 (1957).","journal-title":"Prog Biophys Biophys Chem"},{"key":"BFsrep45389_CR37","doi-asserted-by":"publisher","first-page":"315","DOI":"10.1007\/s003390201340","volume":"75","author":"H Wang","year":"2002","unstructured":"Wang, H. & Oster, G. Ratchets, power strokes, and molecular motors. Appl. Phys. A 75, 315\u2013323 (2002).","journal-title":"Appl. Phys. A"},{"key":"BFsrep45389_CR38","doi-asserted-by":"publisher","first-page":"55","DOI":"10.1016\/0079-6107(79)90025-7","volume":"33","author":"E Eisenberg","year":"1979","unstructured":"Eisenberg, E. & Hill, T. L. A cross-bridge model of muscle contraction. Prog. Biophys. Mol. Biol. 33, 55\u201382 (1979).","journal-title":"Prog. Biophys. Mol. Biol."},{"key":"BFsrep45389_CR39","doi-asserted-by":"publisher","first-page":"116","DOI":"10.1016\/0014-5793(90)81064-U","volume":"273","author":"E Rosche","year":"1990","unstructured":"Rosche, E. & Westhoff, P. Primary structure of pyruvate, orthophosphate dikinase in the dicotyledonous C 4 plant Flaveria trinervia. FEBS Lett. 273, 116\u2013121 (1990).","journal-title":"FEBS Lett"},{"key":"BFsrep45389_CR40","doi-asserted-by":"publisher","first-page":"763","DOI":"10.1007\/BF00013761","volume":"26","author":"E Rosche","year":"1994","unstructured":"Rosche, E., Streubel, M. & Westhoff, P. Primary structure of the photosynthetic pyruvate orthophosphate dikinase of the C3 plant Flaveria pringlei and expression analysis of pyruvate orthophosphate dikinase sequences in C3, C3\u2013C4 and C4 Flaveria species. Plant Mol. Biol. 26, 763\u2013769 (1994).","journal-title":"Plant Mol. Biol."},{"key":"BFsrep45389_CR41","doi-asserted-by":"crossref","first-page":"183","DOI":"10.1007\/BF00032598","volume":"24","author":"G Salahas","year":"1990","unstructured":"Salahas, G., Manetas, Y. & Gavalas, N. Assaying for pyruvate, orthophosphate dikinase activity: necessary precautions with phosphoenolpyruvate carboxylase as coupling enzyme. Photosynth. Res. 24, 183\u2013188 (1990).","journal-title":"Photosynth. Res."},{"key":"BFsrep45389_CR42","doi-asserted-by":"crossref","unstructured":"Kabsch, W. XDS. Acta Crystallogr. D. 66, 125\u2013132 (2010).","DOI":"10.1107\/S0907444909047337"},{"key":"BFsrep45389_CR43","doi-asserted-by":"publisher","first-page":"1204","DOI":"10.1107\/S0907444913000061","volume":"69","author":"PR Evans","year":"2013","unstructured":"Evans, P. R. & Murshudov, G. N. How good are my data and what is the resolution? Acta Crystallogr. D. 69, 1204\u20131214 (2013).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR44","doi-asserted-by":"crossref","unstructured":"Collaborative, Computational Project and others. The CCP4 suite: programs for protein crystallography. Acta Crystallogr. D. 50, 760 (1994).","DOI":"10.1107\/S0907444994003112"},{"key":"BFsrep45389_CR45","doi-asserted-by":"publisher","first-page":"658","DOI":"10.1107\/S0021889807021206","volume":"40","author":"AJ McCoy","year":"2007","unstructured":"McCoy, A. J. et al. Phasercrystallographic software. J. Appl. Crystallogr. 40, 658\u2013674 (2007).","journal-title":"J. Appl. Crystallogr"},{"key":"BFsrep45389_CR46","doi-asserted-by":"publisher","first-page":"486","DOI":"10.1107\/S0907444910007493","volume":"66","author":"P Emsley","year":"2010","unstructured":"Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. Features and development of Coot. Acta Crystallogr. D. 66, 486\u2013501 (2010).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR47","doi-asserted-by":"publisher","first-page":"213","DOI":"10.1107\/S0907444909052925","volume":"66","author":"PD Adams","year":"2010","unstructured":"Adams, P. D. et al. PHENIX : a comprehensive Python-based system for macromolecular structure solution. Acta Crystallogr. D. 66, 213\u2013221 (2010).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR48","doi-asserted-by":"publisher","first-page":"622","DOI":"10.1107\/S0021889893002729","volume":"26","author":"B Howlin","year":"1993","unstructured":"Howlin, B., Butler, S. A., Moss, D. S., Harris, G. W. & Driessen, H. P. C. TLSANL: TLS parameter-analysis program for segmented anisotropic refinement of macromolecular structures. J. Appl. Crystallogr. 26, 622\u2013624 (1993).","journal-title":"J. Appl. Crystallogr"},{"key":"BFsrep45389_CR49","doi-asserted-by":"publisher","first-page":"1002","DOI":"10.1107\/S0907444906022116","volume":"62","author":"K Cowtan","year":"2006","unstructured":"Cowtan, K. The Buccaneer software for automated model building. 1. Tracing protein chains. Acta Crystallogr. D. 62, 1002\u20131011 (2006).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR50","doi-asserted-by":"publisher","first-page":"355","DOI":"10.1107\/S0907444911001314","volume":"67","author":"GN Murshudov","year":"2011","unstructured":"Murshudov, G. N. et al. REFMAC 5 for the refinement of macromolecular crystal structures. Acta Crystallogr. D. 67, 355\u2013367 (2011).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR51","doi-asserted-by":"publisher","first-page":"646","DOI":"10.1107\/S1399004714028132","volume":"71","author":"PFEM Afonine","year":"2015","unstructured":"Afonine, P. FEM. : Feature Enhanced Map. Acta Crystallogr. D. 71, 646\u2013666 (2015).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR52","doi-asserted-by":"publisher","first-page":"12","DOI":"10.1107\/S0907444909042073","volume":"66","author":"VB Chen","year":"2010","unstructured":"Chen, V. B. et al. MolProbity : all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D. 66, 12\u201321 (2010).","journal-title":"Acta Crystallogr. D."},{"key":"BFsrep45389_CR53","unstructured":"Schr\u00f6dinger, LLC. The PyMOL Molecular Graphics System, Version 1.8 (2015)."},{"key":"BFsrep45389_CR54","doi-asserted-by":"publisher","first-page":"535","DOI":"10.1016\/S0022-2836(77)80200-3","volume":"112","author":"FC Bernstein","year":"1977","unstructured":"Bernstein, F. C. et al. The protein data bank: A computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535\u2013542 (1977).","journal-title":"J. Mol. Biol."},{"key":"BFsrep45389_CR55","doi-asserted-by":"publisher","first-page":"2295","DOI":"10.1093\/nar\/gkn072","volume":"36","author":"J Pei","year":"2008","unstructured":"Pei, J., Kim, B.-H. & Grishin, N. V. PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res. 36, 2295\u20132300 (2008).","journal-title":"Nucleic Acids Res"},{"key":"BFsrep45389_CR56","doi-asserted-by":"publisher","first-page":"3084","DOI":"10.1021\/ct400341p","volume":"9","author":"DR Roe","year":"2013","unstructured":"Roe, D. R. & Cheatham, T. E. PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. J. Chem. Theory Comput. 9, 3084\u20133095 (2013).","journal-title":"J. Chem. Theory Comput."},{"key":"BFsrep45389_CR57","doi-asserted-by":"publisher","first-page":"1668","DOI":"10.1002\/jcc.20290","volume":"26","author":"DA Case","year":"2005","unstructured":"Case, D. A. et al. The Amber biomolecular simulation programs. J. Comput. Chem. 26, 1668\u20131688 (2005).","journal-title":"J. Comput. Chem."},{"key":"BFsrep45389_CR58","doi-asserted-by":"publisher","first-page":"3543","DOI":"10.1021\/jp4125099","volume":"118","author":"DR Roe","year":"2014","unstructured":"Roe, D. R., Bergonzo, C. & Cheatham, T. E. Evaluation of Enhanced Sampling Provided by Accelerated Molecular Dynamics with Hamiltonian Replica Exchange Methods. J. Phys. Chem. B 118, 3543\u20133552 (2014).","journal-title":"J. Phys. Chem. B"},{"key":"BFsrep45389_CR59","doi-asserted-by":"publisher","first-page":"1041","DOI":"10.1016\/j.bbagen.2014.09.007","volume":"1850","author":"R Galindo-Murillo","year":"2015","unstructured":"Galindo-Murillo, R., Roe, D. R. & Cheatham, T. E. Convergence and reproducibility in molecular dynamics simulations of the DNA duplex d(gcacgaacgaacgaacgc). Biochim. Biophys. Acta 1850, 1041\u20131058 (2015).","journal-title":"Biochim. Biophys. Acta"},{"key":"BFsrep45389_CR60","doi-asserted-by":"publisher","first-page":"89","DOI":"10.1007\/978-1-59745-177-2_5","volume":"443","author":"S Hayward","year":"2008","unstructured":"Hayward, S. & Groot, B. L. Normal Modes and Essential Dynamics. Molecular Modeling of Proteins 443, 89\u2013106 (2008).","journal-title":"Molecular Modeling of Proteins"},{"key":"BFsrep45389_CR61","doi-asserted-by":"publisher","first-page":"3341","DOI":"10.1002\/prot.22841","volume":"78","author":"A Ahmed","year":"2010","unstructured":"Ahmed, A., Villinger, S. & Gohlke, H. Large-scale comparison of protein essential dynamics from molecular dynamics simulations and coarse-grained normal mode analyses. Proteins 78, 3341\u20133352 (2010).","journal-title":"Proteins"},{"key":"BFsrep45389_CR62","doi-asserted-by":"publisher","first-page":"3059","DOI":"10.1093\/nar\/gkf436","volume":"30","author":"K Katoh","year":"2002","unstructured":"Katoh, K. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059\u20133066 (2002).","journal-title":"Nucleic Acids Res"},{"key":"BFsrep45389_CR63","doi-asserted-by":"publisher","first-page":"5.6.1","DOI":"10.1002\/0471250953.bi0506s47","volume":"54","author":"B Webb","year":"2014","unstructured":"Webb, B. & Sali, A. Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics 54, 5.6.1\u20135.6.32 (2014).","journal-title":"Current Protocols in Bioinformatics"},{"key":"BFsrep45389_CR64","doi-asserted-by":"publisher","first-page":"1735","DOI":"10.1006\/jmbi.1998.2401","volume":"285","author":"J Word","year":"1999","unstructured":"Word, J., Lovell, S. C., Richardson, J. S. & Richardson, D. C. Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol. 285, 1735\u20131747 (1999).","journal-title":"J. Mol. Biol."},{"key":"BFsrep45389_CR65","doi-asserted-by":"publisher","first-page":"926","DOI":"10.1063\/1.445869","volume":"79","author":"WL Jorgensen","year":"1983","unstructured":"Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926 (1983).","journal-title":"J. Chem. Phys."},{"key":"BFsrep45389_CR66","doi-asserted-by":"crossref","first-page":"712","DOI":"10.1002\/prot.21123","volume":"65","author":"V Hornak","year":"2006","unstructured":"Hornak, V. et al. Comparison of multiple Amber force fields and development of improved protein backbone parameters. Proteins 65, 712\u2013725 (2006).","journal-title":"Proteins"},{"key":"BFsrep45389_CR67","doi-asserted-by":"publisher","first-page":"281","DOI":"10.1007\/s00894-005-0028-4","volume":"12","author":"N Homeyer","year":"2005","unstructured":"Homeyer, N., Horn, A. H. C., Lanig, H. & Sticht, H. AMBER force-field parameters for phosphorylated amino acids in different protonation states: phosphoserine, phosphothreonine, phosphotyrosine, and phosphohistidine. J. Mol. Model. 12, 281\u2013289 (2005).","journal-title":"J. Mol. Model."},{"key":"BFsrep45389_CR68","doi-asserted-by":"publisher","first-page":"327","DOI":"10.1016\/0021-9991(77)90098-5","volume":"23","author":"J-P Ryckaert","year":"1977","unstructured":"Ryckaert, J.-P., Ciccotti, G. & Berendsen, H. J. Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23, 327\u2013341 (1977).","journal-title":"J. Comput. Phys."},{"key":"BFsrep45389_CR69","doi-asserted-by":"publisher","first-page":"4193","DOI":"10.1021\/ja00119a045","volume":"117","author":"TEI Cheatham","year":"1995","unstructured":"Cheatham, T. E. I., Miller, J. L., Fox, T., Darden, T. A. & Kollman, P. A. Molecular Dynamics Simulations on Solvated Biomolecular Systems: The Particle Mesh Ewald Method Leads to Stable Trajectories of DNA, RNA, and Proteins. J. Am. Chem. Soc. 117, 4193\u20134194 (1995).","journal-title":"J. Am. Chem. Soc."},{"key":"BFsrep45389_CR70","doi-asserted-by":"crossref","first-page":"3684","DOI":"10.1063\/1.448118","volume":"81","author":"HJC Berendsen","year":"1984","unstructured":"Berendsen, H. J. C., Postma, J. P. M., van Gunsteren, W. F., DiNola, A. & Haak, J. R. Molecular dynamics with coupling to an external bath. J. Chem. Phys. 81, 3684 (1984).","journal-title":"J. Chem. Phys."},{"key":"BFsrep45389_CR71","doi-asserted-by":"publisher","first-page":"150","DOI":"10.1002\/prot.1081","volume":"44","author":"DJ Jacobs","year":"2001","unstructured":"Jacobs, D. J., Rader, A. J., Kuhn, L. A. & Thorpe, M. F. Protein flexibility predictions using graph theory. Proteins 44, 150\u2013165 (2001).","journal-title":"Proteins"},{"key":"BFsrep45389_CR72","doi-asserted-by":"publisher","first-page":"187","DOI":"10.1016\/0021-9991(77)90121-8","volume":"23","author":"G Torrie","year":"1977","unstructured":"Torrie, G. & Valleau, J. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. J. Comput. Phys. 23, 187\u2013199 (1977).","journal-title":"J. Comput. Phys."},{"key":"BFsrep45389_CR73","doi-asserted-by":"publisher","first-page":"1011","DOI":"10.1002\/jcc.540130812","volume":"13","author":"S Kumar","year":"1992","unstructured":"Kumar, S., Rosenberg, J. M., Bouzida, D., Swendsen, R. H. & Kollman, P. A. THE weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. J. Comput. Chem. 13, 1011\u20131021 (1992).","journal-title":"J. Comput. Chem."}],"container-title":["Scientific Reports"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.nature.com\/articles\/srep45389.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/www.nature.com\/articles\/srep45389","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"http:\/\/www.nature.com\/doifinder\/10.1038\/srep45389","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"},{"URL":"https:\/\/www.nature.com\/articles\/srep45389.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2022,12,23]],"date-time":"2022-12-23T23:51:27Z","timestamp":1671839487000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.nature.com\/articles\/srep45389"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2017,3,30]]},"references-count":73,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2017,6,19]]}},"alternative-id":["BFsrep45389"],"URL":"https:\/\/doi.org\/10.1038\/srep45389","relation":{},"ISSN":["2045-2322"],"issn-type":[{"value":"2045-2322","type":"electronic"}],"subject":[],"published":{"date-parts":[[2017,3,30]]},"assertion":[{"value":"8 December 2016","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"23 February 2017","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"30 March 2017","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare no competing financial interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"45389"}}
diff --git a/tests/srep45389.json b/tests/srep45389_openalex.json
similarity index 100%
rename from tests/srep45389.json
rename to tests/srep45389_openalex.json
diff --git a/tests/test_abstract_processor.py b/tests/test_abstract_processor.py
new file mode 100644
index 0000000..16971f8
--- /dev/null
+++ b/tests/test_abstract_processor.py
@@ -0,0 +1,376 @@
+from unittest.mock import patch
+
+import pytest
+
+from doi2dataset import DERIVATIVE_ALLOWED_LICENSES, LICENSE_MAP, License
+from doi2dataset.api.client import APIClient
+from doi2dataset.api.processors import AbstractProcessor
+
+
+def create_license_from_map(license_short: str) -> License:
+ """Helper function to create License objects from LICENSE_MAP"""
+ if license_short in LICENSE_MAP:
+ uri, name = LICENSE_MAP[license_short]
+ return License(name=name, uri=uri, short=license_short)
+ else:
+ # For unknown licenses not in the map
+ return License(name="Unknown License", uri="", short=license_short)
+
+
+class TestAbstractProcessor:
+ """Test cases for AbstractProcessor derivative license logic"""
+
+ def setup_method(self):
+ """Setup test fixtures"""
+ self.api_client = APIClient()
+ self.processor = AbstractProcessor(self.api_client)
+
+ def test_derivative_allowed_license_uses_crossref(self):
+ """Test that licenses allowing derivatives attempt CrossRef first"""
+ # Create a license that allows derivatives using LICENSE_MAP
+ license_obj = create_license_from_map("cc-by")
+
+ # Mock the CrossRef method to return an abstract and console output
+ with patch.object(
+ self.processor,
+ "_get_crossref_abstract",
+ return_value="CrossRef abstract text",
+ ) as mock_crossref:
+ with patch.object(
+ self.processor, "_get_openalex_abstract"
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should call CrossRef and get result
+ mock_crossref.assert_called_once_with("10.1234/test")
+ mock_openalex.assert_not_called()
+ assert result.text == "CrossRef abstract text"
+ assert result.source == "crossref"
+
+ def test_derivative_not_allowed_license_uses_openalex(self):
+ """Test that licenses not allowing derivatives use OpenAlex reconstruction"""
+ # Create a license that does not allow derivatives using LICENSE_MAP
+ license_obj = create_license_from_map("cc-by-nd")
+
+ # Mock the OpenAlex method to return an abstract
+ with patch.object(self.processor, "_get_crossref_abstract") as mock_crossref:
+ with patch.object(
+ self.processor,
+ "_get_openalex_abstract",
+ return_value="OpenAlex reconstructed text",
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should skip CrossRef and use OpenAlex
+ mock_crossref.assert_not_called()
+ mock_openalex.assert_called_once_with({})
+ assert result.text == "OpenAlex reconstructed text"
+ assert result.source == "openalex"
+
+ def test_unknown_license_uses_openalex(self):
+ """Test that unknown licenses default to OpenAlex reconstruction"""
+ # Create an unknown license (not in LICENSE_MAP)
+ license_obj = create_license_from_map("unknown-license")
+
+ # Mock the OpenAlex method to return an abstract
+ with patch.object(self.processor, "_get_crossref_abstract") as mock_crossref:
+ with patch.object(
+ self.processor,
+ "_get_openalex_abstract",
+ return_value="OpenAlex reconstructed text",
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should skip CrossRef and use OpenAlex
+ mock_crossref.assert_not_called()
+ mock_openalex.assert_called_once_with({})
+ assert result.text == "OpenAlex reconstructed text"
+ assert result.source == "openalex"
+
+ def test_crossref_fallback_to_openalex(self):
+ """Test fallback to OpenAlex when CrossRef returns no abstract"""
+ # Create a license that allows derivatives using LICENSE_MAP
+ license_obj = create_license_from_map("cc-by")
+
+ # Mock CrossRef to return None (no abstract found)
+ with patch.object(
+ self.processor, "_get_crossref_abstract", return_value=None
+ ) as mock_crossref:
+ with patch.object(
+ self.processor,
+ "_get_openalex_abstract",
+ return_value="OpenAlex fallback text",
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should try CrossRef first, then fall back to OpenAlex
+ mock_crossref.assert_called_once_with("10.1234/test")
+ mock_openalex.assert_called_once_with({})
+ assert result.text == "OpenAlex fallback text"
+ assert result.source == "openalex"
+
+ def test_no_abstract_found_anywhere(self):
+ """Test when no abstract is found in either source"""
+ # Create a license that allows derivatives using LICENSE_MAP
+ license_obj = create_license_from_map("cc-by")
+
+ # Mock both methods to return None
+ with patch.object(
+ self.processor, "_get_crossref_abstract", return_value=None
+ ) as mock_crossref:
+ with patch.object(
+ self.processor, "_get_openalex_abstract", return_value=None
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should try both sources
+ mock_crossref.assert_called_once_with("10.1234/test")
+ mock_openalex.assert_called_once_with({})
+ assert result.text == ""
+ assert result.source == "none"
+
+ @pytest.mark.parametrize("license_short", DERIVATIVE_ALLOWED_LICENSES)
+ def test_all_derivative_allowed_licenses_use_crossref_first(self, license_short):
+ """Test that all licenses in DERIVATIVE_ALLOWED_LICENSES use CrossRef first"""
+ # Create license using LICENSE_MAP data
+ license_obj = create_license_from_map(license_short)
+
+ with patch.object(
+ self.processor, "_get_crossref_abstract", return_value="CrossRef text"
+ ) as mock_crossref:
+ with patch.object(
+ self.processor, "_get_openalex_abstract"
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ result = self.processor.get_abstract(
+ "10.1234/test", {}, license_obj
+ )
+
+ # Should use CrossRef for all derivative-allowed licenses
+ mock_crossref.assert_called_once()
+ mock_openalex.assert_not_called()
+ assert result.source == "crossref"
+
+ def test_derivative_allowed_licenses_set_matches_usage(self):
+ """Test that DERIVATIVE_ALLOWED_LICENSES set is correctly used in logic"""
+ # This is a meta-test to ensure the constant is used correctly
+
+ # Test a license that should allow derivatives using LICENSE_MAP
+ allowed_license = create_license_from_map("cc-by")
+ assert allowed_license.short in DERIVATIVE_ALLOWED_LICENSES
+
+ # Test a license that should not allow derivatives using LICENSE_MAP
+ not_allowed_license = create_license_from_map("cc-by-nd")
+ assert not_allowed_license.short not in DERIVATIVE_ALLOWED_LICENSES
+
+ # Test that the processor logic matches the set
+ with patch.object(
+ self.processor, "_get_crossref_abstract", return_value="CrossRef"
+ ) as mock_crossref:
+ with patch.object(
+ self.processor, "_get_openalex_abstract", return_value="OpenAlex"
+ ) as mock_openalex:
+ with patch.object(self.processor.console, "print") as _mock_print:
+ # Allowed license should use CrossRef
+ result1 = self.processor.get_abstract(
+ "10.1234/test", {}, allowed_license
+ )
+ assert mock_crossref.call_count == 1
+ assert result1.source == "crossref"
+
+ # Reset mocks
+ mock_crossref.reset_mock()
+ mock_openalex.reset_mock()
+
+ # Not allowed license should skip CrossRef
+ result2 = self.processor.get_abstract(
+ "10.1234/test", {}, not_allowed_license
+ )
+ mock_crossref.assert_not_called()
+ mock_openalex.assert_called_once()
+ assert result2.source == "openalex"
+
+ def test_custom_license_console_output(self):
+ """Test console output for custom licenses without names"""
+ # Create a custom license without a name
+ custom_license = License(name="", uri="http://custom.license", short="custom")
+
+ with patch.object(
+ self.processor, "_get_openalex_abstract", return_value="OpenAlex text"
+ ):
+ with patch.object(self.processor.console, "print") as mock_print:
+ result = self.processor.get_abstract("10.1234/test", {}, custom_license)
+
+ # Should print custom license message
+ mock_print.assert_called()
+ # Check that it mentions "Custom license"
+ call_args = mock_print.call_args[0][0]
+ assert "Custom license does not allow derivative works" in call_args
+ assert result.source == "openalex"
+
+ def test_crossref_api_failure(self):
+ """Test _get_crossref_abstract when API call fails"""
+ from unittest.mock import Mock
+
+ # Mock API response failure
+ mock_response = Mock()
+ mock_response.status_code = 404
+
+ with patch.object(
+ self.processor.api_client, "make_request", return_value=mock_response
+ ):
+ result = self.processor._get_crossref_abstract("10.1234/test")
+ assert result is None
+
+ # Test with no response
+ with patch.object(self.processor.api_client, "make_request", return_value=None):
+ result = self.processor._get_crossref_abstract("10.1234/test")
+ assert result is None
+
+ def test_get_openalex_abstract_no_inverted_index(self):
+ """Test _get_openalex_abstract when no abstract_inverted_index exists"""
+ data = {"title": "Test Article"} # No abstract_inverted_index
+
+ result = self.processor._get_openalex_abstract(data)
+ assert result is None
+
+ def test_clean_jats_comprehensive(self):
+ """Test _clean_jats method with various JATS tags"""
+ # Test with None input
+ result = self.processor._clean_jats(None)
+ assert result == ""
+
+ # Test with empty string
+ result = self.processor._clean_jats("")
+ assert result == ""
+
+ # Test with ordered list
+ jats_text = 'First itemSecond item'
+ expected = "- First item
- Second item
"
+ result = self.processor._clean_jats(jats_text)
+ assert result == expected
+
+ # Test with unordered list
+ jats_text = 'Bullet oneBullet two'
+ expected = ""
+ result = self.processor._clean_jats(jats_text)
+ assert result == expected
+
+ # Test with mixed formatting tags
+ jats_text = "This is italic and bold text with superscript and subscript."
+ expected = "This is italic and bold text with superscript and subscript.
"
+ result = self.processor._clean_jats(jats_text)
+ assert result == expected
+
+ # Test with other formatting tags
+ jats_text = "Underlined Code Small caps"
+ expected = "Underlined Code
Small caps"
+ result = self.processor._clean_jats(jats_text)
+ assert result == expected
+
+ # Test with title and blockquote
+ jats_text = "Section TitleThis is a quote"
+ expected = "Section Title
This is a quote
"
+ result = self.processor._clean_jats(jats_text)
+ assert result == expected
+
+ def test_no_abstract_found_console_messages(self):
+ """Test console messages when no abstract is found"""
+ license_obj = create_license_from_map("cc-by-nd") # No derivative allowed
+
+ with patch.object(self.processor, "_get_openalex_abstract", return_value=None):
+ with patch.object(self.processor.console, "print") as mock_print:
+ result = self.processor.get_abstract("10.1234/test", {}, license_obj)
+
+ # Should print warning messages
+ assert mock_print.call_count >= 2
+
+ # Check for specific warning messages
+ call_messages = [call[0][0] for call in mock_print.call_args_list]
+ assert any(
+ "No abstract found in OpenAlex!" in msg for msg in call_messages
+ )
+ assert any(
+ "No abstract found in either CrossRef nor OpenAlex!" in msg
+ for msg in call_messages
+ )
+
+ assert result.text == ""
+ assert result.source == "none"
+
+ def test_crossref_abstract_with_real_data(self, crossref_data):
+ """Test CrossRef abstract extraction using real CrossRef data"""
+ from http import HTTPStatus
+ from unittest.mock import Mock
+
+ # Mock successful API response with real data
+ mock_response = Mock()
+ mock_response.status_code = HTTPStatus.OK
+ mock_response.json.return_value = crossref_data
+
+ # Extract DOI from CrossRef data since we're using other values from the response
+ expected_doi = crossref_data["message"]["DOI"]
+
+ with patch.object(
+ self.processor.api_client, "make_request", return_value=mock_response
+ ):
+ result = self.processor._get_crossref_abstract(expected_doi)
+
+ # Should successfully extract and clean the abstract
+ assert result is not None
+ assert len(result) > 0
+
+ # Check that JATS tags were converted to HTML
+ assert "" in result # JATS paragraphs converted
+ assert "" in result # JATS italic converted
+ assert "" in result # JATS subscript converted
+ assert "jats:" not in result # No JATS tags should remain
+
+ def test_jats_cleaning_comprehensive_real_data(self, crossref_data):
+ """Test JATS cleaning with real CrossRef abstract data"""
+
+ raw_abstract = crossref_data["message"]["abstract"]
+
+ # Clean the JATS tags
+ cleaned = self.processor._clean_jats(raw_abstract)
+
+ # Verify specific transformations from the real data
+ assert "" not in cleaned
+ assert "" in cleaned # Title should be converted
+ assert "" not in cleaned
+ assert "" in cleaned # Paragraphs should be converted
+ assert "" not in cleaned
+ assert "" in cleaned # Subscripts should be converted
+ assert "" not in cleaned
+ assert "" in cleaned # Italics should be converted
+
+ # Ensure the content is preserved by checking for specific content from the abstract
+ assert "pyruvate phosphate dikinase" in cleaned.lower()
+ assert "Abstract" in cleaned
+
+ def test_openalex_abstract_reconstruction_with_real_data(self, openalex_data):
+ """Test OpenAlex abstract reconstruction using real inverted index data"""
+
+ # Extract the abstract using the inverted index
+ result = self.processor._get_openalex_abstract(openalex_data)
+
+ if result: # Only test if there's an abstract in the data
+ assert isinstance(result, str)
+ assert len(result) > 0
+ # Should be reconstructed from word positions
+ assert " " in result # Should have spaces between words
diff --git a/tests/test_api_client.py b/tests/test_api_client.py
new file mode 100644
index 0000000..aea461a
--- /dev/null
+++ b/tests/test_api_client.py
@@ -0,0 +1,528 @@
+"""
+Tests for the API client module.
+
+Tests for error handling, network failures, authentication, and edge cases.
+"""
+
+import json
+from unittest.mock import Mock, patch
+
+import pytest
+import requests
+
+from doi2dataset.api.client import APIClient
+
+
+class TestAPIClientInitialization:
+ """Test API client initialization and header configuration."""
+
+ def test_init_default_params(self):
+ """Test initialization with default parameters."""
+ client = APIClient()
+
+ assert client.session is not None
+ assert "User-Agent" in client.session.headers
+ assert client.session.headers["User-Agent"] == "doi2dataset/2.0"
+
+ def test_init_with_contact_mail(self):
+ """Test initialization with contact email."""
+ client = APIClient(contact_mail="test@example.com")
+
+ expected_ua = "doi2dataset/2.0 (mailto:test@example.com)"
+ assert client.session.headers["User-Agent"] == expected_ua
+
+ def test_init_with_custom_user_agent(self):
+ """Test initialization with custom user agent."""
+ client = APIClient(user_agent="custom-agent/1.0")
+
+ assert client.session.headers["User-Agent"] == "custom-agent/1.0"
+
+ def test_init_with_token(self):
+ """Test initialization with API token."""
+ client = APIClient(token="test-token-123")
+
+ assert client.session.headers["X-Dataverse-key"] == "test-token-123"
+
+ def test_init_with_all_params(self):
+ """Test initialization with all parameters."""
+ client = APIClient(
+ contact_mail="test@example.com", user_agent="custom/1.0", token="token-123"
+ )
+
+ assert "mailto:test@example.com" in client.session.headers["User-Agent"]
+ assert client.session.headers["X-Dataverse-key"] == "token-123"
+
+
+class TestAPIClientRequests:
+ """Test API client request handling."""
+
+ def test_make_request_success(self):
+ """Test successful GET request."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.return_value = {"success": True}
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response == mock_response
+ mock_request.assert_called_once_with("GET", "https://api.example.com/test")
+
+ def test_make_request_post_with_data(self):
+ """Test POST request with JSON data."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 201
+ mock_request.return_value = mock_response
+
+ test_data = {"key": "value"}
+ response = client.make_request(
+ "https://api.example.com/create", method="POST", json=test_data
+ )
+
+ assert response == mock_response
+ mock_request.assert_called_once_with(
+ "POST", "https://api.example.com/create", json=test_data
+ )
+
+ def test_make_request_with_auth(self):
+ """Test request with authentication."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_request.return_value = mock_response
+
+ auth = ("username", "password")
+ response = client.make_request("https://api.example.com/secure", auth=auth)
+
+ assert response == mock_response
+ mock_request.assert_called_once_with(
+ "GET", "https://api.example.com/secure", auth=auth
+ )
+
+
+class TestAPIClientErrorHandling:
+ """Test error handling scenarios."""
+
+ def test_connection_error_returns_none(self):
+ """Test that connection errors return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_request.side_effect = requests.exceptions.ConnectionError(
+ "Connection failed"
+ )
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+ def test_timeout_error_returns_none(self):
+ """Test that timeout errors return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_request.side_effect = requests.exceptions.Timeout("Request timed out")
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+ def test_http_error_returns_none(self):
+ """Test that HTTP errors return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError(
+ "404 Not Found"
+ )
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/notfound")
+
+ assert response is None
+
+ def test_request_exception_returns_none(self):
+ """Test that general request exceptions return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_request.side_effect = requests.exceptions.RequestException(
+ "General error"
+ )
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+ def test_ssl_error_returns_none(self):
+ """Test that SSL errors return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_request.side_effect = requests.exceptions.SSLError(
+ "SSL verification failed"
+ )
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+ def test_too_many_redirects_returns_none(self):
+ """Test that redirect errors return None."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_request.side_effect = requests.exceptions.TooManyRedirects(
+ "Too many redirects"
+ )
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+
+class TestAPIClientStatusCodeHandling:
+ """Test handling of HTTP status codes."""
+
+ @pytest.mark.parametrize("status_code", [400, 401, 403, 404, 500, 502, 503])
+ def test_error_status_codes_return_none(self, status_code):
+ """Test that error status codes return None after raise_for_status."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = status_code
+ mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError(
+ f"{status_code} Error"
+ )
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response is None
+
+ @pytest.mark.parametrize("status_code", [200, 201, 202, 204])
+ def test_success_status_codes_return_response(self, status_code):
+ """Test that success status codes return the response."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = status_code
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/test")
+
+ assert response == mock_response
+
+
+class TestAPIClientContextManager:
+ """Test context manager functionality."""
+
+ def test_context_manager_enter(self):
+ """Test context manager __enter__ method."""
+ client = APIClient()
+
+ with client as context_client:
+ assert context_client is client
+
+ def test_context_manager_exit_calls_close(self):
+ """Test context manager __exit__ calls close."""
+ client = APIClient()
+
+ with patch.object(client, "close") as mock_close:
+ with client:
+ pass
+ mock_close.assert_called_once()
+
+ def test_context_manager_exit_with_exception(self):
+ """Test context manager handles exceptions properly."""
+ client = APIClient()
+
+ with patch.object(client, "close") as mock_close:
+ try:
+ with client:
+ raise ValueError("Test exception")
+ except ValueError:
+ pass
+ mock_close.assert_called_once()
+
+ def test_close_method(self):
+ """Test the close method calls session.close."""
+ client = APIClient()
+
+ with patch.object(client.session, "close") as mock_close:
+ client.close()
+ mock_close.assert_called_once()
+
+
+class TestAPIClientUsageScenarios:
+ """Test usage scenarios."""
+
+ def test_openalex_api_call(self):
+ """Test OpenAlex API call."""
+ client = APIClient(contact_mail="test@university.edu")
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.return_value = {
+ "id": "https://openalex.org/W123456789",
+ "title": "Test Paper",
+ "authors": [],
+ }
+ mock_request.return_value = mock_response
+
+ response = client.make_request(
+ "https://api.openalex.org/works/10.1000/test"
+ )
+
+ assert response is not None
+ assert response.json()["title"] == "Test Paper"
+
+ def test_dataverse_upload(self):
+ """Test Dataverse metadata upload."""
+ client = APIClient(token="dataverse-token-123")
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 201
+ mock_response.json.return_value = {
+ "status": "OK",
+ "data": {"persistentId": "doi:10.5072/FK2/ABC123"},
+ }
+ mock_request.return_value = mock_response
+
+ metadata = {"datasetVersion": {"files": []}}
+ response = client.make_request(
+ "https://demo.dataverse.org/api/dataverses/test/datasets",
+ method="POST",
+ json=metadata,
+ auth=("user", "pass"),
+ )
+
+ assert response is not None
+ assert "persistentId" in response.json()["data"]
+
+ def test_network_failure_fallback(self):
+ """Test fallback handling for network failures."""
+ client = APIClient()
+ urls_to_try = [
+ "https://primary-api.example.com/data",
+ "https://fallback-api.example.com/data",
+ ]
+
+ with patch.object(client.session, "request") as mock_request:
+ # First request fails, second succeeds
+ mock_request.side_effect = [
+ requests.exceptions.ConnectionError("Primary API down"),
+ Mock(status_code=200, json=lambda: {"source": "fallback"}),
+ ]
+
+ response = None
+ for url in urls_to_try:
+ response = client.make_request(url)
+ if response is not None:
+ break
+
+ assert response is not None
+ assert response.json()["source"] == "fallback"
+
+ def test_rate_limit_handling(self):
+ """Test handling of rate limit responses."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 429
+ mock_response.headers = {"Retry-After": "60"}
+ mock_response.raise_for_status.side_effect = requests.exceptions.HTTPError(
+ "429 Too Many Requests"
+ )
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/data")
+
+ # Should return None for rate limited responses
+ assert response is None
+
+ def test_malformed_json_response(self):
+ """Test handling of malformed JSON responses."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.side_effect = json.JSONDecodeError("Invalid JSON", "", 0)
+ mock_response.text = "Invalid JSON response"
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/data")
+
+ # Should still return the response even if JSON parsing fails
+ assert response == mock_response
+
+ def test_large_response(self):
+ """Test handling of large responses."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ # Simulate a large response
+ large_data = {"items": [{"id": i} for i in range(10000)]}
+ mock_response.json.return_value = large_data
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/large-dataset")
+
+ assert response is not None
+ assert len(response.json()["items"]) == 10000
+
+ def test_unicode_in_responses(self):
+ """Test handling of Unicode characters in responses."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ unicode_data = {
+ "title": "Étude sur les caractères spéciaux: αβγ, 中文, 日本語",
+ "author": "José María García-López",
+ }
+ mock_response.json.return_value = unicode_data
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.example.com/unicode-data")
+
+ assert response is not None
+ data = response.json()
+ assert "Étude" in data["title"]
+ assert "García" in data["author"]
+
+ def test_custom_headers_persist(self):
+ """Test custom headers are preserved across requests."""
+ client = APIClient(contact_mail="test@example.com", token="test-token")
+
+ # Add custom header
+ client.session.headers.update({"Custom-Header": "custom-value"})
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_request.return_value = mock_response
+
+ client.make_request("https://api.example.com/test")
+
+ # Verify all headers are present
+ assert "User-Agent" in client.session.headers
+ assert "X-Dataverse-key" in client.session.headers
+ assert "Custom-Header" in client.session.headers
+ assert client.session.headers["Custom-Header"] == "custom-value"
+
+
+def test_api_response_structure_processing(openalex_data):
+ """Test API client processes complex nested response structures correctly."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.return_value = openalex_data
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.openalex.org/works/test")
+
+ assert response is not None
+ data = response.json()
+
+ # Test that nested structures are preserved through the request pipeline
+ if "authorships" in data:
+ assert isinstance(data["authorships"], list)
+ # Test deep nesting preservation
+ for authorship in data["authorships"]:
+ if "institutions" in authorship:
+ assert isinstance(authorship["institutions"], list)
+
+ # Test data type preservation through JSON serialization/deserialization
+ for key, value in data.items():
+ assert value is not None or key in [
+ "abstract_inverted_index",
+ "abstract_inverted_index_v3",
+ ] # Some fields can legitimately be None
+
+
+def test_api_unicode_encoding_processing(openalex_data):
+ """Test API client correctly processes Unicode characters in responses."""
+ client = APIClient()
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.return_value = openalex_data
+ mock_response.encoding = "utf-8"
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.openalex.org/works/test")
+
+ assert response is not None
+ data = response.json()
+
+ # Test that Unicode characters are preserved through processing pipeline
+ def check_unicode_preservation(obj):
+ if isinstance(obj, str):
+ # Should preserve Unicode characters
+ try:
+ obj.encode("utf-8")
+ return True
+ except UnicodeEncodeError:
+ return False
+ elif isinstance(obj, dict):
+ return all(check_unicode_preservation(v) for v in obj.values())
+ elif isinstance(obj, list):
+ return all(check_unicode_preservation(item) for item in obj)
+ return True
+
+ assert check_unicode_preservation(data)
+
+
+def test_large_response_processing_efficiency(openalex_data):
+ """Test API client efficiently processes large response payloads."""
+ client = APIClient()
+
+ # Create large response based on real structure
+ large_data = dict(openalex_data)
+ if "referenced_works" in large_data:
+ # Extend existing referenced works
+ base_works = (
+ large_data["referenced_works"][:10]
+ if large_data["referenced_works"]
+ else []
+ )
+ large_data["referenced_works"] = base_works * 100 # Create large list
+
+ with patch.object(client.session, "request") as mock_request:
+ mock_response = Mock()
+ mock_response.status_code = 200
+ mock_response.json.return_value = large_data
+ mock_request.return_value = mock_response
+
+ response = client.make_request("https://api.openalex.org/works/test")
+
+ assert response is not None
+ data = response.json()
+
+ # Verify large data structures are handled correctly
+ if "referenced_works" in data:
+ assert len(data["referenced_works"]) > 100
+ # All elements should maintain structure integrity
+ assert all(isinstance(work, str) for work in data["referenced_works"])
diff --git a/tests/test_citation_builder.py b/tests/test_citation_builder.py
index 055e93e..75f53bc 100644
--- a/tests/test_citation_builder.py
+++ b/tests/test_citation_builder.py
@@ -1,18 +1,8 @@
-import json
-import os
-
import pytest
from doi2dataset import CitationBuilder, Person, PIFinder
-
-@pytest.fixture
-def openalex_data():
- """Load the saved JSON response from the file 'srep45389.json'"""
- json_path = os.path.join(os.path.dirname(__file__), "srep45389.json")
- with open(json_path, "r", encoding="utf-8") as f:
- data = json.load(f)
- return data
+# openalex_data fixture now comes from conftest.py
@pytest.fixture
@@ -23,7 +13,7 @@ def test_pi():
given_name="Author",
orcid="0000-0000-0000-1234",
email="test.author@example.org",
- affiliation="Test University"
+ affiliation="Test University",
)
@@ -115,7 +105,9 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
pytest.skip("Test data doesn't contain any ROR identifiers")
# Create builder with ror=True to enable ROR identifiers
- builder = CitationBuilder(data=openalex_data, doi=doi, pi_finder=pi_finder, ror=True)
+ builder = CitationBuilder(
+ data=openalex_data, doi=doi, pi_finder=pi_finder, ror=True
+ )
# Get authors
authors, _ = builder.build_authors()
@@ -129,11 +121,11 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
for author in authors:
# Check if author has affiliation
- if not hasattr(author, 'affiliation') or not author.affiliation:
+ if not hasattr(author, "affiliation") or not author.affiliation:
continue
# Check if affiliation is an Institution with a ROR ID
- if not hasattr(author.affiliation, 'ror'):
+ if not hasattr(author.affiliation, "ror"):
continue
# Check if ROR ID is present and contains "ror.org"
@@ -154,7 +146,7 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
assert affiliation_field.value == institution_with_ror.ror
# Verify the expanded_value dictionary has the expected structure
- assert hasattr(affiliation_field, 'expanded_value')
+ assert hasattr(affiliation_field, "expanded_value")
assert isinstance(affiliation_field.expanded_value, dict)
# Check specific fields in the expanded_value
@@ -167,3 +159,121 @@ def test_build_authors_with_ror(openalex_data, pi_finder):
assert "@type" in expanded_value
assert expanded_value["@type"] == "https://schema.org/Organization"
+
+
+def test_build_authors_with_real_data(openalex_data, pi_finder):
+ """Test author building with real OpenAlex data structure"""
+ doi = openalex_data["doi"].replace("https://doi.org/", "")
+ builder = CitationBuilder(data=openalex_data, doi=doi, pi_finder=pi_finder)
+
+ authors, corresponding = builder.build_authors()
+
+ # Should have multiple authors from the real data
+ assert len(authors) > 0
+
+ # Extract expected author names from the API response data
+ expected_authors = []
+ for authorship in openalex_data.get("authorships", []):
+ if "author" in authorship and "display_name" in authorship["author"]:
+ expected_authors.append(authorship["author"]["display_name"])
+
+ # Check that real author names from API response are processed correctly
+ author_names = [f"{author.given_name} {author.family_name}" for author in authors]
+
+ # Verify that at least some expected authors from the API response are found
+ found_authors = 0
+ for expected_name in expected_authors:
+ if any(expected_name in author_name for author_name in author_names):
+ found_authors += 1
+
+ # Should find at least some authors from the API response
+ assert (
+ found_authors > 0
+ ), f"No expected authors found. Expected: {expected_authors}, Got: {author_names}"
+
+
+def test_process_author_edge_cases(pi_finder):
+ """Test _process_author with various edge cases"""
+ builder = CitationBuilder(
+ data={"authorships": []}, doi="10.1000/test", pi_finder=pi_finder
+ )
+
+ # Test with minimal author data
+ minimal_author = {"display_name": "John Smith"}
+ empty_authorship = {}
+ person = builder._process_author(minimal_author, empty_authorship)
+ assert person.given_name == "John"
+ assert person.family_name == "Smith"
+
+ # Test with ORCID
+ author_with_orcid = {
+ "display_name": "Jane Doe",
+ "orcid": "https://orcid.org/0000-0000-0000-0000",
+ }
+ person = builder._process_author(author_with_orcid, empty_authorship)
+ assert person.orcid == "0000-0000-0000-0000" # URL part is stripped
+
+
+def test_build_grants_with_default_config(pi_finder):
+ """Test that grants include default grants from config"""
+ import os
+
+ from doi2dataset import Config
+
+ # Load test config
+ config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
+ Config.load_config(config_path=config_path)
+
+ # Use real data structure but focus on grants behavior
+ data = {"authorships": [], "grants": []}
+
+ builder = CitationBuilder(data=data, doi="10.1000/test", pi_finder=pi_finder)
+ grants = builder.build_grants()
+
+ # Should have at least the default grants from config
+ # The exact number depends on the config, but should be >= 0
+ assert isinstance(grants, list)
+ for grant in grants:
+ assert len(grant) == 2 # Should have agency and value fields
+ assert grant[0].name == "grantNumberAgency"
+ assert grant[1].name == "grantNumberValue"
+
+
+def test_process_corresponding_author_no_email(pi_finder):
+ """Test _process_corresponding_author when no email is available"""
+ builder = CitationBuilder(
+ data={"authorships": []}, doi="10.1000/test", pi_finder=pi_finder
+ )
+
+ # Create a Person without email
+ person = Person(
+ given_name="John", family_name="Doe", orcid=None, email=None, affiliation=None
+ )
+
+ authorship = {"is_corresponding": True}
+
+ result = builder._process_corresponding_author(person, authorship)
+
+ # Should return None when no email is available
+ assert result is None
+
+
+def test_build_authors_skip_empty_authorships(pi_finder):
+ """Test that empty author entries are skipped"""
+ data_with_empty_authors = {
+ "authorships": [
+ {"author": {}}, # Empty author
+ {}, # No author key
+ {"author": {"display_name": "John Doe"}}, # Valid author
+ ]
+ }
+
+ builder = CitationBuilder(
+ data=data_with_empty_authors, doi="10.1000/test", pi_finder=pi_finder
+ )
+ authors, corresponding = builder.build_authors()
+
+ # Should only process the one valid author
+ assert len(authors) == 1
+ assert authors[0].given_name == "John"
+ assert authors[0].family_name == "Doe"
diff --git a/tests/test_cli.py b/tests/test_cli.py
new file mode 100644
index 0000000..366eb1b
--- /dev/null
+++ b/tests/test_cli.py
@@ -0,0 +1,377 @@
+"""
+Tests for the CLI module.
+
+Tests for command-line argument parsing, error handling, and integration scenarios.
+"""
+
+import argparse
+import tempfile
+from io import StringIO
+from pathlib import Path
+from unittest.mock import Mock, patch
+
+from rich.console import Console
+from rich.theme import Theme
+
+from doi2dataset.cli import (
+ create_argument_parser,
+ main,
+ print_summary,
+ process_doi_batch,
+)
+
+
+class TestArgumentParser:
+ """Test argument parsing functionality."""
+
+ def test_create_argument_parser_basic(self):
+ """Test basic argument parser creation."""
+ parser = create_argument_parser()
+ assert isinstance(parser, argparse.ArgumentParser)
+ assert "Process DOIs to generate metadata" in parser.description
+
+ def test_parser_with_dois_only(self):
+ """Test parsing with DOI arguments only."""
+ parser = create_argument_parser()
+ args = parser.parse_args(["10.1000/test1", "10.1000/test2"])
+
+ assert args.dois == ["10.1000/test1", "10.1000/test2"]
+ assert args.file is None
+ assert args.output_dir == "."
+ assert args.depositor is None
+ assert args.subject == "Medicine, Health and Life Sciences"
+ assert args.contact_mail is False
+ assert args.upload is False
+ assert args.use_ror is False
+
+ def test_parser_with_file_option(self):
+ """Test parsing with file option."""
+ with tempfile.NamedTemporaryFile(mode="w", delete=False) as f:
+ f.write("10.1000/test1\n10.1000/test2\n")
+ f.flush()
+
+ parser = create_argument_parser()
+ args = parser.parse_args(["-f", f.name])
+
+ assert args.file is not None
+ assert args.file.name == f.name
+
+ def test_parser_with_all_options(self):
+ """Test parsing with all available options."""
+ parser = create_argument_parser()
+ args = parser.parse_args(
+ [
+ "10.1000/test",
+ "-o",
+ "/tmp/output",
+ "-d",
+ "John Doe",
+ "-s",
+ "Computer Science",
+ "-m",
+ "test@example.com",
+ "-u",
+ "-r",
+ ]
+ )
+
+ assert args.dois == ["10.1000/test"]
+ assert args.output_dir == "/tmp/output"
+ assert args.depositor == "John Doe"
+ assert args.subject == "Computer Science"
+ assert args.contact_mail == "test@example.com"
+ assert args.upload is True
+ assert args.use_ror is True
+
+ def test_parser_help_message(self):
+ """Test that help message is properly formatted."""
+ parser = create_argument_parser()
+ help_str = parser.format_help()
+
+ assert "Process DOIs to generate metadata" in help_str
+ assert "One or more DOIs to process" in help_str
+ assert "--file" in help_str
+ assert "--output-dir" in help_str
+
+
+class TestPrintSummary:
+ """Test the print_summary function."""
+
+ def test_print_summary_success_only(self):
+ """Test summary with only successful results."""
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), width=80, theme=theme)
+ results = {"success": ["10.1000/test1", "10.1000/test2"], "failed": []}
+
+ print_summary(results, console)
+ output = console.file.getvalue()
+
+ assert "Success" in output
+ assert "2" in output
+ assert "10.1000/test1" in output
+
+ def test_print_summary_with_failures(self):
+ """Test summary with both success and failures."""
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), width=80, theme=theme)
+ results = {
+ "success": ["10.1000/test1"],
+ "failed": [("10.1000/test2", "Connection error")],
+ }
+
+ print_summary(results, console)
+ output = console.file.getvalue()
+
+ assert "Success" in output
+ assert "Failed" in output
+ assert "1" in output
+ assert "10.1000/test2" in output
+
+ def test_print_summary_truncation(self):
+ """Test that long lists are properly truncated."""
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), width=80, theme=theme)
+ results = {
+ "success": [f"10.1000/test{i}" for i in range(5)],
+ "failed": [(f"10.1000/fail{i}", "error") for i in range(5)],
+ }
+
+ print_summary(results, console)
+ output = console.file.getvalue()
+
+ assert "..." in output # Should show truncation
+
+
+class TestProcessDoiBatch:
+ """Test the process_doi_batch function."""
+
+ @patch("doi2dataset.cli.MetadataProcessor")
+ def test_process_doi_batch_success(self, mock_processor_class):
+ """Test successful batch processing."""
+ mock_processor = Mock()
+ mock_processor.process.return_value = None
+ mock_processor_class.return_value = mock_processor
+
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), theme=theme)
+ output_dir = Path("/tmp/test")
+ dois = {"10.1000/test1", "10.1000/test2"}
+
+ results = process_doi_batch(dois=dois, output_dir=output_dir, console=console)
+
+ assert len(results["success"]) == 2
+ assert len(results["failed"]) == 0
+ assert mock_processor_class.call_count == 2
+
+ @patch("doi2dataset.cli.MetadataProcessor")
+ def test_process_doi_batch_with_failures(self, mock_processor_class):
+ """Test batch processing with some failures."""
+
+ def side_effect(*args, **kwargs):
+ # First call succeeds, second fails
+ if mock_processor_class.call_count == 1:
+ mock = Mock()
+ mock.process.return_value = None
+ return mock
+ else:
+ mock = Mock()
+ mock.process.side_effect = ValueError("API Error")
+ return mock
+
+ mock_processor_class.side_effect = side_effect
+
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), theme=theme)
+ output_dir = Path("/tmp/test")
+ dois = {"10.1000/test1", "10.1000/test2"}
+
+ results = process_doi_batch(dois=dois, output_dir=output_dir, console=console)
+
+ assert len(results["success"]) == 1
+ assert len(results["failed"]) == 1
+ assert "API Error" in results["failed"][0][1]
+
+ @patch("doi2dataset.cli.MetadataProcessor")
+ def test_process_doi_batch_with_upload(self, mock_processor_class):
+ """Test batch processing with upload flag."""
+ mock_processor = Mock()
+ mock_processor.process.return_value = None
+ mock_processor_class.return_value = mock_processor
+
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), theme=theme)
+ output_dir = Path("/tmp/test")
+ dois = {"10.1000/test1"}
+
+ process_doi_batch(
+ dois=dois, output_dir=output_dir, upload=True, console=console
+ )
+
+ # Verify processor was called with upload=True
+ mock_processor_class.assert_called_once()
+ call_kwargs = mock_processor_class.call_args[1]
+ assert call_kwargs["upload"] is True
+
+ @patch("doi2dataset.cli.sanitize_filename")
+ @patch("doi2dataset.cli.normalize_doi")
+ @patch("doi2dataset.cli.MetadataProcessor")
+ def test_process_doi_batch_filename_generation(
+ self, mock_processor_class, mock_normalize, mock_sanitize
+ ):
+ """Test that DOI filenames are properly generated."""
+ mock_normalize.return_value = "10.1000/test"
+ mock_sanitize.return_value = "10_1000_test"
+
+ mock_processor = Mock()
+ mock_processor.process.return_value = None
+ mock_processor_class.return_value = mock_processor
+
+ theme = Theme(
+ {"info": "cyan", "warning": "yellow", "error": "red", "success": "green"}
+ )
+ console = Console(file=StringIO(), theme=theme)
+ output_dir = Path("/tmp/test")
+ dois = {"10.1000/test"}
+
+ process_doi_batch(dois=dois, output_dir=output_dir, console=console)
+
+ mock_normalize.assert_called_once_with("10.1000/test")
+ mock_sanitize.assert_called_once_with("10.1000/test")
+
+ # Check that output path was constructed correctly
+ call_kwargs = mock_processor_class.call_args[1]
+ expected_path = output_dir / "10_1000_test_metadata.json"
+ assert call_kwargs["output_path"] == expected_path
+
+
+class TestMainFunction:
+ """Test the main CLI entry point."""
+
+ @patch("doi2dataset.cli.process_doi_batch")
+ @patch("sys.argv", ["doi2dataset", "10.1000/test"])
+ def test_main_with_doi_argument(self, mock_process):
+ """Test main function with DOI argument."""
+ mock_process.return_value = {"success": ["10.1000/test"], "failed": []}
+
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_not_called()
+ mock_process.assert_called_once()
+
+ @patch("sys.argv", ["doi2dataset"])
+ def test_main_no_arguments_exits(self):
+ """Test that main exits when no DOIs are provided."""
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_called_once_with(1)
+
+ @patch("doi2dataset.cli.validate_email_address")
+ @patch("sys.argv", ["doi2dataset", "10.1000/test", "-m", "invalid-email"])
+ def test_main_invalid_email_exits(self, mock_validate):
+ """Test main exits with invalid email."""
+ mock_validate.return_value = False
+
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_called_once_with(1)
+
+ @patch("doi2dataset.cli.validate_email_address")
+ @patch("doi2dataset.cli.process_doi_batch")
+ @patch("sys.argv", ["doi2dataset", "10.1000/test", "-m", "valid@example.com"])
+ def test_main_valid_email_continues(self, mock_process, mock_validate):
+ """Test main continues with valid email."""
+ mock_validate.return_value = True
+ mock_process.return_value = {"success": ["10.1000/test"], "failed": []}
+
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_not_called()
+
+ @patch("doi2dataset.cli.process_doi_batch")
+ def test_main_keyboard_interrupt(self, mock_process):
+ """Test main handles KeyboardInterrupt gracefully."""
+ mock_process.side_effect = KeyboardInterrupt()
+
+ with patch("sys.argv", ["doi2dataset", "10.1000/test"]):
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_called_once_with(1)
+
+ @patch("doi2dataset.cli.process_doi_batch")
+ def test_main_unexpected_error(self, mock_process):
+ """Test main handles unexpected errors gracefully."""
+ mock_process.side_effect = Exception("Unexpected error")
+
+ with patch("sys.argv", ["doi2dataset", "10.1000/test"]):
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_called_once_with(1)
+
+ @patch("doi2dataset.cli.process_doi_batch")
+ def test_main_output_directory_creation_failure(self, mock_process):
+ """Test main handles output directory creation failure."""
+ mock_process.return_value = {"success": [], "failed": []}
+
+ with patch("sys.argv", ["doi2dataset", "10.1000/test", "-o", "/invalid/path"]):
+ with patch(
+ "pathlib.Path.mkdir", side_effect=PermissionError("Permission denied")
+ ):
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_called_once_with(1)
+
+ def test_main_file_input_integration(self):
+ """Test main with file input."""
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
+ f.write("10.1000/test1\n10.1000/test2\n\n# Comment line\n")
+ f.flush()
+
+ with patch("sys.argv", ["doi2dataset", "-f", f.name]):
+ with patch("doi2dataset.cli.process_doi_batch") as mock_process:
+ mock_process.return_value = {
+ "success": ["10.1000/test1", "10.1000/test2"],
+ "failed": [],
+ }
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_not_called()
+
+ # Verify DOIs were correctly parsed from file
+ call_args = mock_process.call_args[1]
+ dois = call_args["dois"]
+ assert "10.1000/test1" in dois
+ assert "10.1000/test2" in dois
+ # Note: Comment filtering happens in CLI main(), not in our mock
+
+ def test_main_combined_file_and_args_input(self):
+ """Test main with both file and argument DOIs."""
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".txt", delete=False) as f:
+ f.write("10.1000/file1\n10.1000/file2\n")
+ f.flush()
+
+ with patch("sys.argv", ["doi2dataset", "10.1000/arg1", "-f", f.name]):
+ with patch("doi2dataset.cli.process_doi_batch") as mock_process:
+ mock_process.return_value = {"success": [], "failed": []}
+ with patch("sys.exit") as mock_exit:
+ main()
+ mock_exit.assert_not_called()
+
+ # Verify all DOIs were collected
+ call_args = mock_process.call_args[1]
+ dois = call_args["dois"]
+ assert "10.1000/arg1" in dois
+ assert "10.1000/file1" in dois
+ assert "10.1000/file2" in dois
+ assert len(dois) == 3
diff --git a/tests/test_doi2dataset.py b/tests/test_doi2dataset.py
deleted file mode 100644
index e5515d8..0000000
--- a/tests/test_doi2dataset.py
+++ /dev/null
@@ -1,38 +0,0 @@
-import os
-import sys
-
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
-
-from doi2dataset import NameProcessor, sanitize_filename, validate_email_address
-
-
-def test_sanitize_filename():
- """Test the sanitize_filename function to convert DOI to a valid filename."""
- doi = "10.1234/abc.def"
- expected = "10_1234_abc_def"
- result = sanitize_filename(doi)
- assert result == expected
-
-def test_split_name_with_comma():
- """Test splitting a full name that contains a comma."""
- full_name = "Doe, John"
- given, family = NameProcessor.split_name(full_name)
- assert given == "John"
- assert family == "Doe"
-
-def test_split_name_without_comma():
- """Test splitting a full name that does not contain a comma."""
- full_name = "John Doe"
- given, family = NameProcessor.split_name(full_name)
- assert given == "John"
- assert family == "Doe"
-
-def test_validate_email_address_valid():
- """Test that a valid email address is correctly recognized."""
- valid_email = "john.doe@iana.org"
- assert validate_email_address(valid_email) is True
-
-def test_validate_email_address_invalid():
- """Test that an invalid email address is correctly rejected."""
- invalid_email = "john.doe@invalid_domain"
- assert validate_email_address(invalid_email) is False
diff --git a/tests/test_fetch_doi_mock.py b/tests/test_fetch_doi_mock.py
deleted file mode 100644
index 6e2745c..0000000
--- a/tests/test_fetch_doi_mock.py
+++ /dev/null
@@ -1,203 +0,0 @@
-import json
-import os
-
-import pytest
-
-from doi2dataset import (
- AbstractProcessor,
- APIClient,
- CitationBuilder,
- Config,
- LicenseProcessor,
- MetadataProcessor,
- Person,
- PIFinder,
- SubjectMapper,
-)
-
-
-class FakeResponse:
- """
- A fake response object to simulate an API response.
- """
- def __init__(self, json_data, status_code=200):
- self._json = json_data
- self.status_code = status_code
-
- def json(self):
- return self._json
-
- def raise_for_status(self):
- pass
-
-@pytest.fixture(autouse=True)
-def load_config_test():
- """
- Automatically load the configuration from 'config_test.yaml'
- located in the same directory as this test file.
- """
- config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
- Config.load_config(config_path=config_path)
-
-@pytest.fixture
-def fake_openalex_response():
- """
- Load the saved JSON response from the file 'srep45389.json'
- located in the same directory as this test file.
- """
- json_path = os.path.join(os.path.dirname(__file__), "srep45389.json")
- with open(json_path, "r", encoding="utf-8") as f:
- data = json.load(f)
- return data
-
-def test_fetch_doi_data_with_file(mocker, fake_openalex_response):
- """
- Test fetching DOI metadata by simulating the API call with a locally saved JSON response.
-
- The APIClient.make_request method is patched to return a fake response built from the contents
- of 'srep45389.json', ensuring that the configuration is loaded from 'config_test.yaml'.
- """
- doi = "10.1038/srep45389"
- fake_response = FakeResponse(fake_openalex_response, 200)
-
- # Patch the make_request method of APIClient to return our fake_response.
- mocker.patch("doi2dataset.APIClient.make_request", return_value=fake_response)
-
- # Instantiate MetadataProcessor without upload and progress.
- processor = MetadataProcessor(doi=doi, upload=False)
-
- # Call _fetch_data(), which should now return our fake JSON data.
- data = processor._fetch_data()
-
- # Verify that the fetched data matches the fake JSON data.
- assert data == fake_openalex_response
-
-
-def test_openalex_abstract_extraction(mocker, fake_openalex_response):
- """Test the extraction of abstracts from OpenAlex inverted index data."""
- # Create API client for AbstractProcessor
- api_client = APIClient()
-
- # Create processor
- processor = AbstractProcessor(api_client=api_client)
-
- # Call the protected method directly with the fake response
- abstract_text = processor._get_openalex_abstract(fake_openalex_response)
-
- # Verify abstract was extracted
- assert abstract_text is not None
-
- # If abstract exists in the response, it should be properly extracted
- if 'abstract_inverted_index' in fake_openalex_response:
- assert len(abstract_text) > 0
-
-
-def test_subject_mapper(fake_openalex_response):
- """Test that the SubjectMapper correctly maps OpenAlex topics to subjects."""
- # Extract topics from the OpenAlex response
- topics = fake_openalex_response.get("topics", [])
-
- # Convert topics to strings - we'll use display_name
- topic_names = []
- if topics:
- topic_names = [topic.get("display_name") for topic in topics if topic.get("display_name")]
-
- # Get subjects using the class method
- subjects = SubjectMapper.get_subjects({"topics": topics})
-
- # Verify subjects were returned
- assert subjects is not None
- assert isinstance(subjects, list)
-
-
-def test_citation_builder(fake_openalex_response):
- """Test that the CitationBuilder correctly builds author information."""
- doi = "10.1038/srep45389"
-
- # Mock PIFinder with an empty list of PIs
- pi_finder = PIFinder(pis=[])
-
- # Create builder with required arguments
- builder = CitationBuilder(data=fake_openalex_response, doi=doi, pi_finder=pi_finder)
-
- # Test building other IDs
- other_ids = builder.build_other_ids()
- assert isinstance(other_ids, list)
-
- # Test building grants
- grants = builder.build_grants()
- assert isinstance(grants, list)
-
- # Test building topics
- topics = builder.build_topics()
- assert isinstance(topics, list)
-
-
-def test_license_processor(fake_openalex_response):
- """Test that the LicenseProcessor correctly identifies and processes licenses."""
- # Create a simplified data structure that contains license info
- license_data = {
- "primary_location": fake_openalex_response.get("primary_location", {})
- }
-
- # Process the license
- license_obj = LicenseProcessor.process_license(license_data)
-
- # Verify license processing
- assert license_obj is not None
- assert hasattr(license_obj, "name")
- assert hasattr(license_obj, "uri")
-
-
-def test_pi_finder_find_by_orcid():
- """Test that PIFinder can find a PI by ORCID."""
- # Create a Person object that matches the test config
- test_pi = Person(
- family_name="Doe",
- given_name="Jon",
- orcid="0000-0000-0000-0000",
- email="jon.doe@iana.org",
- affiliation="Institute of Science, Some University"
- )
-
- # Create PIFinder with our test PI
- finder = PIFinder(pis=[test_pi])
-
- # Find PI by ORCID
- pi = finder._find_by_orcid("0000-0000-0000-0000")
-
- # Verify the PI was found
- assert pi is not None
- assert pi.family_name == "Doe"
- assert pi.given_name == "Jon"
-
-
-def test_config_load_invalid_path():
- """Test that Config.load_config raises an error when an invalid path is provided."""
- invalid_path = "non_existent_config.yaml"
-
- # Verify that attempting to load a non-existent config raises an error
- with pytest.raises(FileNotFoundError):
- Config.load_config(config_path=invalid_path)
-
-
-def test_metadata_processor_fetch_data(mocker, fake_openalex_response):
- """Test the _fetch_data method of the MetadataProcessor class with mocked responses."""
- doi = "10.1038/srep45389"
-
- # Mock API response
- mocker.patch("doi2dataset.APIClient.make_request",
- return_value=FakeResponse(fake_openalex_response, 200))
-
- # Create processor with upload disabled and progress disabled
- processor = MetadataProcessor(doi=doi, upload=False, progress=False)
-
- # Test the _fetch_data method directly
- data = processor._fetch_data()
-
- # Verify that data was fetched correctly
- assert data is not None
- assert data == fake_openalex_response
-
- # Verify the DOI is correctly stored
- assert processor.doi == doi
diff --git a/tests/test_integration.py b/tests/test_integration.py
new file mode 100644
index 0000000..c62f27e
--- /dev/null
+++ b/tests/test_integration.py
@@ -0,0 +1,569 @@
+import os
+from unittest.mock import patch
+
+import pytest
+
+from doi2dataset import (
+ AbstractProcessor,
+ APIClient,
+ CitationBuilder,
+ Config,
+ LicenseProcessor,
+ MetadataProcessor,
+ NameProcessor,
+ Person,
+ PIFinder,
+ SubjectMapper,
+)
+
+
+class FakeResponse:
+ """
+ A fake response object to simulate an API response.
+ """
+
+ def __init__(self, json_data, status_code=200):
+ self._json = json_data
+ self.status_code = status_code
+
+ def json(self):
+ return self._json
+
+ def raise_for_status(self):
+ pass
+
+
+@pytest.fixture(autouse=True)
+def load_config_test():
+ """
+ Automatically load the configuration from 'config_test.yaml'
+ located in the same directory as this test file.
+ """
+ config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
+ Config.load_config(config_path=config_path)
+
+
+def test_fetch_doi_data_with_file(mocker, openalex_data):
+ """
+ Test fetching DOI metadata by simulating the API call with a locally saved JSON response.
+
+ The APIClient.make_request method is patched to return a fake response built from the contents
+ of 'srep45389.json', ensuring that the configuration is loaded from 'config_test.yaml'.
+ """
+ doi = openalex_data["doi"].replace("https://doi.org/", "")
+ fake_response = FakeResponse(openalex_data, 200)
+
+ # Patch the make_request method of APIClient to return our fake_response.
+ mocker.patch("doi2dataset.APIClient.make_request", return_value=fake_response)
+
+ # Instantiate MetadataProcessor without upload and progress.
+ processor = MetadataProcessor(doi=doi, upload=False)
+
+ # Call _fetch_data(), which should now return our fake JSON data.
+ data = processor._fetch_data()
+
+ # Verify that the fetched data matches the OpenAlex data.
+ assert data == openalex_data
+
+
+def test_openalex_abstract_extraction(openalex_data):
+ """Test the extraction of abstracts from OpenAlex inverted index data."""
+ # Create API client for AbstractProcessor
+ api_client = APIClient()
+
+ # Create processor
+ processor = AbstractProcessor(api_client=api_client)
+
+ # Call the protected method directly with the fake response
+ result = processor._get_openalex_abstract(openalex_data)
+
+ # Verify abstract was extracted
+ assert result is not None
+
+ # If abstract exists in the response, it should be properly extracted
+ if "abstract_inverted_index" in openalex_data:
+ assert len(result) > 0
+
+
+def test_subject_mapper(openalex_data):
+ """Test that the SubjectMapper correctly maps OpenAlex topics to subjects."""
+ # Extract topics from the OpenAlex response
+ topics = openalex_data.get("topics", [])
+
+ # Get subjects using the class method
+ subjects = SubjectMapper.get_subjects({"topics": topics})
+
+ # Verify subjects were returned
+ assert subjects is not None
+ assert isinstance(subjects, list)
+
+
+def test_citation_builder(openalex_data):
+ """Test that the CitationBuilder correctly builds author information."""
+ doi = openalex_data["doi"].replace("https://doi.org/", "")
+
+ # Mock PIFinder with an empty list of PIs
+ pi_finder = PIFinder(pis=[])
+
+ # Create builder with required arguments
+ builder = CitationBuilder(data=openalex_data, doi=doi, pi_finder=pi_finder)
+
+ # Test building other IDs
+ other_ids = builder.build_other_ids()
+ assert isinstance(other_ids, list)
+
+ # Test building grants
+ grants = builder.build_grants()
+ assert isinstance(grants, list)
+
+ # Test building topics
+ topics = builder.build_topics()
+ assert isinstance(topics, list)
+
+
+def test_license_processor(openalex_data):
+ """Test that the LicenseProcessor correctly identifies and processes licenses."""
+ # Create a simplified data structure that contains license info
+ license_data = {"primary_location": openalex_data.get("primary_location", {})}
+
+ # Process the license
+ license_obj = LicenseProcessor.process_license(license_data)
+
+ # Verify license processing
+ assert license_obj is not None
+ assert hasattr(license_obj, "name")
+ assert hasattr(license_obj, "uri")
+
+
+def test_pi_finder_find_by_orcid():
+ """Test that PIFinder can find a PI by ORCID."""
+ # Create a Person object that matches the test config
+ test_pi = Person(
+ family_name="Doe",
+ given_name="Jon",
+ orcid="0000-0000-0000-0000",
+ email="jon.doe@iana.org",
+ affiliation="Institute of Science, Some University",
+ )
+
+ # Create PIFinder with our test PI
+ finder = PIFinder(pis=[test_pi])
+
+ # Find PI by ORCID
+ pi = finder._find_by_orcid("0000-0000-0000-0000")
+
+ # Verify the PI was found
+ assert pi is not None
+ assert pi.family_name == "Doe"
+ assert pi.given_name == "Jon"
+
+
+def test_config_load_invalid_path():
+ """Test that Config.load_config raises an error when an invalid path is provided."""
+ invalid_path = "non_existent_config.yaml"
+
+ # Verify that attempting to load a non-existent config raises an error
+ with pytest.raises(FileNotFoundError):
+ Config.load_config(config_path=invalid_path)
+
+
+def test_metadata_processor_fetch_data(mocker, openalex_data):
+ """Test the _fetch_data method of the MetadataProcessor class with mocked responses."""
+ doi = openalex_data["doi"].replace("https://doi.org/", "")
+
+ # Mock API response
+ mocker.patch(
+ "doi2dataset.APIClient.make_request",
+ return_value=FakeResponse(openalex_data, 200),
+ )
+
+ # Create processor with upload disabled and progress disabled
+ processor = MetadataProcessor(doi=doi, upload=False, progress=False)
+
+ # Test the _fetch_data method directly
+ data = processor._fetch_data()
+
+ # Verify that data was fetched correctly
+ assert data is not None
+ assert data == openalex_data
+
+ # Verify the DOI is correctly stored
+ assert processor.doi == doi
+
+
+# Processing utils edge case tests
+class TestNameProcessorEdgeCases:
+ """Test name processing edge cases."""
+
+ def test_normalize_string_basic(self):
+ """Test basic string normalization."""
+ result = NameProcessor.normalize_string("Hello World")
+ assert result == "hello world"
+
+ def test_normalize_string_unicode(self):
+ """Test that Unicode characters are properly handled."""
+ result = NameProcessor.normalize_string("Café résumé naïve")
+ assert result == "cafe resume naive"
+
+ def test_normalize_string_case(self):
+ """Test case normalization."""
+ result = NameProcessor.normalize_string("CamelCaseString")
+ assert result == "camelcasestring"
+
+ def test_normalize_string_special_chars(self):
+ """Test handling of special characters and punctuation."""
+ result = NameProcessor.normalize_string("Name-O'Connor Jr.")
+ assert result == "name-o'connor jr."
+
+ def test_normalize_string_empty(self):
+ """Test normalization of empty string."""
+ result = NameProcessor.normalize_string("")
+ assert result == ""
+
+ def test_normalize_string_whitespace(self):
+ """Test normalization of whitespace-only string."""
+ result = NameProcessor.normalize_string(" \n\t ")
+ assert result == " \n\t "
+
+ def test_split_name_multiple_middle(self):
+ """Test splitting names with multiple middle names."""
+ given, family = NameProcessor.split_name("John Michael David Smith")
+ assert given == "John Michael David"
+ assert family == "Smith"
+
+ def test_split_name_comma_multiple_first(self):
+ """Test comma format with multiple first names."""
+ given, family = NameProcessor.split_name("Smith, John Michael")
+ assert given == "John Michael"
+ assert family == "Smith"
+
+ def test_split_name_single(self):
+ """Test splitting when only one name is provided."""
+ given, family = NameProcessor.split_name("Madonna")
+ assert given == ""
+ assert family == "Madonna"
+
+ def test_split_name_hyphenated(self):
+ """Test splitting hyphenated surnames."""
+ given, family = NameProcessor.split_name("John Smith-Johnson")
+ assert given == "John"
+ assert family == "Smith-Johnson"
+
+ def test_split_name_empty(self):
+ """Test splitting empty string."""
+ # NameProcessor.split_name doesn't handle empty strings properly
+ # This test documents the current behavior
+ try:
+ given, family = NameProcessor.split_name("")
+ raise AssertionError("Should raise IndexError")
+ except IndexError:
+ pass # Expected behavior
+
+
+class TestPIFinderEdgeCases:
+ """Test PI finding edge cases."""
+
+ def setup_method(self):
+ """Set up test PI data."""
+ self.test_pis = [
+ Person(
+ given_name="John",
+ family_name="Doe",
+ orcid="0000-0000-0000-0001",
+ email="john.doe@university.edu",
+ ),
+ Person(
+ given_name="Jane",
+ family_name="Smith",
+ orcid="0000-0000-0000-0002",
+ email="jane.smith@institute.org",
+ ),
+ Person(
+ given_name="Robert",
+ family_name="Johnson",
+ orcid=None, # No ORCID
+ email="robert.johnson@lab.gov",
+ ),
+ ]
+
+ def test_find_by_orcid_no_match(self):
+ """Test finding PI by ORCID when no matches exist."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(
+ given_name="Unknown", family_name="Author", orcid="0000-0000-0000-9999"
+ )
+ ]
+
+ matches = finder.find_by_orcid(authors)
+ assert len(matches) == 0
+
+ def test_find_by_orcid_multiple(self):
+ """Test finding multiple PIs by ORCID."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(given_name="John", family_name="Doe", orcid="0000-0000-0000-0001"),
+ Person(given_name="Jane", family_name="Smith", orcid="0000-0000-0000-0002"),
+ Person(
+ given_name="Unknown", family_name="Author", orcid="0000-0000-0000-9999"
+ ),
+ ]
+
+ matches = finder.find_by_orcid(authors)
+ assert len(matches) == 2
+ orcids = {match.orcid for match in matches}
+ assert "0000-0000-0000-0001" in orcids
+ assert "0000-0000-0000-0002" in orcids
+
+ def test_find_by_orcid_empty(self):
+ """Test finding PI by ORCID with empty author list."""
+ finder = PIFinder(self.test_pis)
+ matches = finder.find_by_orcid([])
+ assert len(matches) == 0
+
+ def test_find_by_orcid_none(self):
+ """Test finding PI by ORCID when authors have no ORCIDs."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(given_name="John", family_name="Doe", orcid=None),
+ Person(given_name="Jane", family_name="Smith", orcid=""),
+ ]
+ matches = finder.find_by_orcid(authors)
+ assert len(matches) == 0
+
+ def test_find_corresponding_email_pi_match(self):
+ """Test finding corresponding authors when PI matches have email."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(
+ given_name="John",
+ family_name="Doe",
+ orcid="0000-0000-0000-0001",
+ email="john.doe@university.edu",
+ ),
+ Person(given_name="Other", family_name="Author", email="other@example.com"),
+ ]
+
+ corresponding = finder.find_corresponding_authors(authors)
+ assert len(corresponding) == 1
+ assert corresponding[0].orcid == "0000-0000-0000-0001"
+
+ def test_find_corresponding_email_no_pi(self):
+ """Test finding corresponding authors with email but no PI match."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(
+ given_name="Unknown", family_name="Author1", email="author1@example.com"
+ ),
+ Person(
+ given_name="Unknown", family_name="Author2", email="author2@example.com"
+ ),
+ ]
+
+ corresponding = finder.find_corresponding_authors(authors)
+ assert len(corresponding) == 2 # All authors with email
+
+ def test_find_corresponding_fallback_first(self):
+ """Test fallback to first author when no other criteria match."""
+ finder = PIFinder(self.test_pis)
+ authors = [
+ Person(given_name="Unknown", family_name="Author1"),
+ Person(given_name="Unknown", family_name="Author2"),
+ ]
+
+ corresponding = finder.find_corresponding_authors(authors)
+ assert len(corresponding) == 1
+ assert corresponding[0].family_name == "Author1"
+
+ def test_find_corresponding_empty(self):
+ """Test finding corresponding authors with empty author list."""
+ finder = PIFinder(self.test_pis)
+ corresponding = finder.find_corresponding_authors([])
+ assert len(corresponding) == 0
+
+ def test_find_pi_by_name(self):
+ """Test finding PI by exact name match."""
+ finder = PIFinder(self.test_pis)
+ pi = finder.find_pi(given_name="Jane", family_name="Smith")
+ assert pi is not None
+ assert pi.orcid == "0000-0000-0000-0002"
+
+ def test_find_pi_case_insensitive(self):
+ """Test that PI finding is case insensitive."""
+ finder = PIFinder(self.test_pis)
+ pi = finder.find_pi(given_name="JOHN", family_name="DOE")
+ assert pi is not None
+ assert pi.orcid == "0000-0000-0000-0001"
+
+ def test_find_pi_no_match(self):
+ """Test finding PI when no match exists."""
+ finder = PIFinder(self.test_pis)
+ pi = finder.find_pi(given_name="NonExistent", family_name="Person")
+ assert pi is None
+
+ @patch("doi2dataset.processing.utils.normalize_orcid")
+ def test_find_by_orcid_normalize_fail(self, mock_normalize):
+ """Test handling of ORCID normalization failure."""
+ mock_normalize.side_effect = Exception("Normalization failed")
+
+ finder = PIFinder(self.test_pis)
+ pi = finder._find_by_orcid("0000-0000-0000-0001")
+
+ # Should fall back to direct string comparison
+ assert pi is not None
+ assert pi.given_name == "John"
+
+
+class TestSubjectMapperEdgeCases:
+ """Test subject mapping edge cases."""
+
+ def test_map_subjects_exact(self):
+ """Test mapping of exact vocabulary matches."""
+ subjects = ["Computer Science", "Mathematics", "Physics"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ expected = [
+ "Computer and Information Science",
+ "Mathematical Sciences",
+ "Physics",
+ ]
+ assert mapped == expected
+
+ def test_map_subjects_partial(self):
+ """Test mapping with partial string matching."""
+ subjects = ["Computer", "Math", "Life Science"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ assert "Computer and Information Science" in mapped
+ assert "Mathematical Sciences" in mapped
+ assert "Medicine, Health and Life Sciences" in mapped
+
+ def test_map_subjects_case(self):
+ """Test that subject mapping is case insensitive."""
+ subjects = ["COMPUTER SCIENCE", "mathematics", "PhYsIcS"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ assert "Computer and Information Science" in mapped
+ assert "Mathematical Sciences" in mapped
+ # Physics maps to "Astronomy and Astrophysics" for partial matches
+ assert "Astronomy and Astrophysics" in mapped
+
+ def test_map_subjects_no_match(self):
+ """Test that unmapped subjects default to 'Other'."""
+ subjects = ["Nonexistent Field", "Made Up Science"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ assert mapped == ["Other"]
+
+ def test_map_subjects_mixed(self):
+ """Test mapping with mix of known and unknown subjects."""
+ subjects = ["Physics", "Nonexistent Field", "Chemistry"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ assert "Physics" in mapped
+ assert "Chemistry" in mapped
+ assert "Other" in mapped
+ assert len(mapped) == 3
+
+ def test_map_subjects_dedupe(self):
+ """Test that duplicate mapped subjects are removed."""
+ subjects = ["Computer Science", "Computer and Information Science", "Computer"]
+ mapped = SubjectMapper.map_subjects(subjects)
+
+ # All should map to the same thing, but current implementation doesn't dedupe properly
+ # This test documents the current behavior
+ assert "Computer and Information Science" in mapped
+
+ def test_map_subjects_empty(self):
+ """Test mapping empty subject list."""
+ mapped = SubjectMapper.map_subjects([])
+ assert mapped == ["Other"]
+
+ def test_map_single_subject(self):
+ """Test mapping single known subject."""
+ result = SubjectMapper.map_single_subject("Physics")
+ assert result == "Physics"
+
+ def test_map_single_unknown(self):
+ """Test mapping single unknown subject."""
+ result = SubjectMapper.map_single_subject("Nonexistent Field")
+ assert result == "Other"
+
+ def test_map_single_partial(self):
+ """Test mapping single subject with partial match."""
+ result = SubjectMapper.map_single_subject("Computer")
+ assert result == "Computer and Information Science"
+
+ def test_get_subjects_with_topics(self):
+ """Test extracting subjects from data with topics."""
+ data = {
+ "topics": [
+ {
+ "subfield": {"display_name": "Machine Learning"},
+ "field": {"display_name": "Computer Science"},
+ "domain": {"display_name": "Physical Sciences"},
+ },
+ {
+ "subfield": {"display_name": "Quantum Physics"},
+ "field": {"display_name": "Physics"},
+ "domain": {"display_name": "Physical Sciences"},
+ },
+ ]
+ }
+
+ subjects = SubjectMapper.get_subjects(data)
+ assert "Computer and Information Science" in subjects
+ assert "Physics" in subjects
+
+ def test_get_subjects_empty_topics(self):
+ """Test extracting subjects when topics are empty."""
+ data = {"topics": []}
+ subjects = SubjectMapper.get_subjects(data, fallback_subject="Custom Fallback")
+ # Current implementation returns ["Other"] regardless of fallback_subject parameter
+ assert subjects == ["Other"]
+
+ def test_get_subjects_no_topics_key(self):
+ """Test extracting subjects when topics key is missing."""
+ data = {"title": "Some Paper"}
+ subjects = SubjectMapper.get_subjects(data)
+ assert subjects == ["Other"]
+
+ def test_get_subjects_none_values(self):
+ """Test extracting subjects when display_name values are None."""
+ data = {
+ "topics": [
+ {
+ "subfield": {"display_name": None},
+ "field": {"display_name": "Computer Science"},
+ "domain": {"display_name": None},
+ }
+ ]
+ }
+
+ subjects = SubjectMapper.get_subjects(data)
+ assert "Computer and Information Science" in subjects
+
+ def test_controlled_vocab(self):
+ """Test that controlled vocabulary contains expected fields."""
+ vocab = SubjectMapper.CONTROLLED_VOCAB
+
+ # Check for key subject areas
+ assert "Computer and Information Science" in vocab.values()
+ assert "Medicine, Health and Life Sciences" in vocab.values()
+ assert "Physics" in vocab.values()
+ assert "Mathematical Sciences" in vocab.values()
+ assert "Other" in vocab.values()
+
+ def test_subject_aliases(self):
+ """Test that common aliases are covered."""
+ # Test some expected aliases
+ test_cases = [
+ ("Computer Science", "Computer and Information Science"),
+ ("Life Sciences", "Medicine, Health and Life Sciences"),
+ ("Mathematics", "Mathematical Sciences"),
+ ("Medicine", "Medicine, Health and Life Sciences"),
+ ]
+
+ for alias, expected in test_cases:
+ result = SubjectMapper.map_single_subject(alias)
+ assert result == expected, f"Failed for alias: {alias}"
diff --git a/tests/test_license_processor.py b/tests/test_license_processor.py
index bdb5ef5..c972fe2 100644
--- a/tests/test_license_processor.py
+++ b/tests/test_license_processor.py
@@ -1,39 +1,29 @@
-import pytest
-from doi2dataset import LicenseProcessor, License
+from doi2dataset import DERIVATIVE_ALLOWED_LICENSES, License, LicenseProcessor
+
def test_license_processor_cc_by():
"""Test processing a CC BY license"""
- data = {
- "primary_location": {
- "license": "cc-by"
- }
- }
+ data = {"primary_location": {"license": "cc-by"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "cc-by"
assert license_obj.name == "CC BY 4.0"
assert license_obj.uri == "https://creativecommons.org/licenses/by/4.0/"
+
def test_license_processor_cc0():
"""Test processing a CC0 license"""
- data = {
- "primary_location": {
- "license": "cc0"
- }
- }
+ data = {"primary_location": {"license": "cc0"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "cc0"
assert license_obj.name == "CC0 1.0"
assert license_obj.uri == "https://creativecommons.org/publicdomain/zero/1.0/"
+
def test_license_processor_unknown_license():
"""Test processing an unknown license"""
- data = {
- "primary_location": {
- "license": "unknown-license"
- }
- }
+ data = {"primary_location": {"license": "unknown-license"}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "unknown-license"
@@ -41,17 +31,17 @@ def test_license_processor_unknown_license():
assert license_obj.name == "unknown-license" or license_obj.name == ""
assert hasattr(license_obj, "uri")
+
def test_license_processor_no_license():
"""Test processing with no license information"""
- data = {
- "primary_location": {}
- }
+ data = {"primary_location": {}}
license_obj = LicenseProcessor.process_license(data)
assert isinstance(license_obj, License)
assert license_obj.short == "unknown"
assert license_obj.name == ""
assert license_obj.uri == ""
+
def test_license_processor_no_primary_location():
"""Test processing with no primary location"""
data = {}
@@ -59,4 +49,135 @@ def test_license_processor_no_primary_location():
assert isinstance(license_obj, License)
assert license_obj.short == "unknown"
assert license_obj.name == ""
- assert license_obj.uri == ""
\ No newline at end of file
+ assert license_obj.uri == ""
+
+
+def test_derivative_allowed_licenses_cc_by():
+ """Test that CC BY license allows derivatives"""
+ assert "cc-by" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_cc_by_sa():
+ """Test that CC BY-SA license allows derivatives"""
+ assert "cc-by-sa" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_cc_by_nc():
+ """Test that CC BY-NC license allows derivatives"""
+ assert "cc-by-nc" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_cc_by_nc_sa():
+ """Test that CC BY-NC-SA license allows derivatives"""
+ assert "cc-by-nc-sa" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_cc0():
+ """Test that CC0 license allows derivatives"""
+ assert "cc0" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_public_domain():
+ """Test that Public Domain license allows derivatives"""
+ assert "pd" in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_not_allowed_licenses_cc_by_nd():
+ """Test that CC BY-ND license does not allow derivatives"""
+ assert "cc-by-nd" not in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_not_allowed_licenses_cc_by_nc_nd():
+ """Test that CC BY-NC-ND license does not allow derivatives"""
+ assert "cc-by-nc-nd" not in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_not_allowed_licenses_unknown():
+ """Test that unknown licenses do not allow derivatives"""
+ assert "unknown-license" not in DERIVATIVE_ALLOWED_LICENSES
+ assert "all-rights-reserved" not in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_derivative_allowed_licenses_set_completeness():
+ """Test that DERIVATIVE_ALLOWED_LICENSES contains expected licenses"""
+ expected_licenses = {"cc-by", "cc-by-sa", "cc-by-nc", "cc-by-nc-sa", "cc0", "pd"}
+ assert DERIVATIVE_ALLOWED_LICENSES == expected_licenses
+
+
+def test_license_processing_with_real_openalex_structure(openalex_data):
+ """Test that license processor correctly handles real OpenAlex data structure."""
+ # Process license data exactly as the real application would
+ license_obj = LicenseProcessor.process_license(openalex_data)
+
+ # Verify the processing logic works with real data structure
+ assert isinstance(license_obj, License)
+ assert hasattr(license_obj, "short")
+ assert hasattr(license_obj, "name")
+ assert hasattr(license_obj, "uri")
+
+ # Test derivative permission logic with real license
+ if license_obj.short in DERIVATIVE_ALLOWED_LICENSES:
+ # Should be able to use CrossRef abstract
+ assert license_obj.short in [
+ "cc-by",
+ "cc-by-sa",
+ "cc-by-nc",
+ "cc-by-nc-sa",
+ "cc0",
+ "pd",
+ ]
+ else:
+ # Should use OpenAlex abstract reconstruction
+ assert license_obj.short not in DERIVATIVE_ALLOWED_LICENSES
+
+
+def test_license_processing_with_multiple_locations(openalex_data):
+ """Test license processing logic with multiple publication locations."""
+ # Process all locations like the real application might encounter
+ locations = openalex_data.get("locations", [])
+
+ processed_licenses = []
+ for location in locations:
+ # Create data structure as it would appear from API
+ location_data = {"primary_location": location}
+ license_obj = LicenseProcessor.process_license(location_data)
+ processed_licenses.append(license_obj)
+
+ # Verify processing logic works for all location types
+ assert len(processed_licenses) > 0
+ assert all(isinstance(lic, License) for lic in processed_licenses)
+
+ # Should handle various license states consistently
+ for license_obj in processed_licenses:
+ if license_obj.short != "unknown":
+ assert (
+ license_obj.short in DERIVATIVE_ALLOWED_LICENSES
+ or license_obj.short not in DERIVATIVE_ALLOWED_LICENSES
+ )
+
+
+def test_crossref_license_url_mapping_logic(crossref_data):
+ """Test license URL to short-form mapping logic with real CrossRef data."""
+ # Extract license information as the real application would
+ crossref_licenses = crossref_data.get("message", {}).get("license", [])
+
+ if crossref_licenses:
+ license_url = crossref_licenses[0].get("URL", "")
+
+ # Test the mapping logic that would be used in practice
+ from doi2dataset import LICENSE_MAP
+
+ # Find corresponding short form by URL matching
+ matching_short = None
+ for short, (uri, _name) in LICENSE_MAP.items():
+ if uri == license_url:
+ matching_short = short
+ break
+
+ if matching_short:
+ # Test that our license processor handles this correctly
+ test_data = {"primary_location": {"license": matching_short}}
+ license_obj = LicenseProcessor.process_license(test_data)
+
+ assert license_obj.short == matching_short
+ assert license_obj.uri == license_url
diff --git a/tests/test_metadata_processor.py b/tests/test_metadata_processor.py
index b8a3c62..ab0ae89 100644
--- a/tests/test_metadata_processor.py
+++ b/tests/test_metadata_processor.py
@@ -1,19 +1,14 @@
import json
-import os
-from unittest.mock import MagicMock
+import tempfile
+from http import HTTPStatus
+from pathlib import Path
+from unittest.mock import MagicMock, Mock, patch
import pytest
from doi2dataset import MetadataProcessor
-
-@pytest.fixture
-def openalex_data():
- """Load the saved JSON response from the file 'srep45389.json'"""
- json_path = os.path.join(os.path.dirname(__file__), "srep45389.json")
- with open(json_path, "r", encoding="utf-8") as f:
- data = json.load(f)
- return data
+# openalex_data fixture now comes from conftest.py
@pytest.fixture
@@ -33,7 +28,10 @@ def test_build_metadata_basic_fields(metadata_processor, openalex_data, monkeypa
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
- monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
+ monkeypatch.setattr(
+ "doi2dataset.AbstractProcessor.get_abstract",
+ lambda *args, **kwargs: abstract_mock,
+ )
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@@ -47,21 +45,95 @@ def test_build_metadata_basic_fields(metadata_processor, openalex_data, monkeypa
# Verify the basic metadata fields were extracted correctly
assert metadata is not None
- assert 'datasetVersion' in metadata
+ assert "datasetVersion" in metadata
# Examine the fields inside datasetVersion.metadataBlocks
- assert 'metadataBlocks' in metadata['datasetVersion']
- citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
+ assert "metadataBlocks" in metadata["datasetVersion"]
+ citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
- assert 'fields' in citation
- fields = citation['fields']
+ assert "fields" in citation
+ fields = citation["fields"]
# Check for basic metadata fields in a more flexible way
- field_names = [field.get('typeName') for field in fields]
- assert 'title' in field_names
- assert 'subject' in field_names
- assert 'dsDescription' in field_names # Description is named 'dsDescription' in the schema
+ field_names = [field.get("typeName") for field in fields]
+ assert "title" in field_names
+
+
+def test_build_metadata_missing_critical_fields(
+ metadata_processor, openalex_data, monkeypatch
+):
+ """Test _build_metadata behavior when critical fields are missing"""
+
+ metadata_processor.console = MagicMock()
+ data = openalex_data.copy()
+ # Remove title and publicationDate to simulate missing fields
+ if "title" in data["title"]:
+ data.pop("title", None)
+ if "publicationDate" in data:
+ data.pop("publicationDate", None)
+
+ # Mock abstract retrieval
+ abstract_mock = MagicMock()
+ abstract_mock.text = "Abstract text"
+ abstract_mock.source = "crossref"
+ monkeypatch.setattr(
+ "doi2dataset.AbstractProcessor.get_abstract",
+ lambda *args, **kwargs: abstract_mock,
+ )
+
+ metadata_processor._fetch_data = MagicMock(return_value=data)
+ metadata_processor._build_description = MagicMock(return_value="Description text")
+ metadata_processor._get_involved_pis = MagicMock(return_value=[])
+
+ metadata = metadata_processor._build_metadata(data)
+
+ assert metadata is not None
+ # It should still produce metadataVersion even with missing fields
+ assert "datasetVersion" in metadata
+
+
+def test_license_processing_with_unknown_license(
+ metadata_processor, openalex_data, monkeypatch
+):
+ """Test license processing when license info is missing or unknown"""
+
+ metadata_processor.console = MagicMock()
+ data = openalex_data.copy()
+
+ # Modify license processing to simulate unknown license
+ def fake_process_license(_):
+ from doi2dataset.core.models import License
+
+ return License(name="", uri="", short="unknown")
+
+ monkeypatch.setattr(
+ "doi2dataset.LicenseProcessor.process_license", fake_process_license
+ )
+
+ monkeypatch.setattr(
+ "doi2dataset.AbstractProcessor.get_abstract",
+ lambda *args, **kwargs: MagicMock(text="Sample abstract", source="openalex"),
+ )
+ metadata_processor._fetch_data = MagicMock(return_value=data)
+ metadata_processor._build_description = MagicMock(return_value="Description text")
+ monkeypatch.setattr(metadata_processor, "_get_involved_pis", lambda _: [])
+
+ metadata = metadata_processor._build_metadata(data)
+
+ # It should return a metadata dict without errors even if license is unknown
+ assert metadata is not None
+
+ citation = (
+ metadata.get("datasetVersion", {}).get("metadataBlocks", {}).get("citation", {})
+ )
+ fields = citation.get("fields", [])
+ field_names = [field.get("typeName") for field in fields]
+
+ assert "subject" in field_names
+ assert (
+ "dsDescription" in field_names
+ ) # Description is named 'dsDescription' in the schema
def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
@@ -73,7 +145,10 @@ def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
- monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
+ monkeypatch.setattr(
+ "doi2dataset.AbstractProcessor.get_abstract",
+ lambda *args, **kwargs: abstract_mock,
+ )
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@@ -86,33 +161,35 @@ def test_build_metadata_authors(metadata_processor, openalex_data, monkeypatch):
metadata = metadata_processor._build_metadata(openalex_data)
# Examine the fields inside datasetVersion.metadataBlocks
- assert 'metadataBlocks' in metadata['datasetVersion']
- citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
+ assert "metadataBlocks" in metadata["datasetVersion"]
+ citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
- assert 'fields' in citation
- fields = citation['fields']
+ assert "fields" in citation
+ fields = citation["fields"]
# Check for author and datasetContact fields
- field_names = [field.get('typeName') for field in fields]
- assert 'author' in field_names
- assert 'datasetContact' in field_names
+ field_names = [field.get("typeName") for field in fields]
+ assert "author" in field_names
+ assert "datasetContact" in field_names
# Verify these are compound fields with actual entries
for field in fields:
- if field.get('typeName') == 'author':
- assert 'value' in field
- assert isinstance(field['value'], list)
- assert len(field['value']) > 0
+ if field.get("typeName") == "author":
+ assert "value" in field
+ assert isinstance(field["value"], list)
+ assert len(field["value"]) > 0
- if field.get('typeName') == 'datasetContact':
- assert 'value' in field
- assert isinstance(field['value'], list)
+ if field.get("typeName") == "datasetContact":
+ assert "value" in field
+ assert isinstance(field["value"], list)
# The datasetContact might be empty in test environment
# Just check it exists rather than asserting length
-def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, monkeypatch):
+def test_build_metadata_keywords_and_topics(
+ metadata_processor, openalex_data, monkeypatch
+):
"""Test that _build_metadata correctly extracts keywords and topics"""
# Mock the console to avoid print errors
metadata_processor.console = MagicMock()
@@ -121,7 +198,10 @@ def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, m
abstract_mock = MagicMock()
abstract_mock.text = "This is a sample abstract"
abstract_mock.source = "openalex"
- monkeypatch.setattr("doi2dataset.AbstractProcessor.get_abstract", lambda *args, **kwargs: abstract_mock)
+ monkeypatch.setattr(
+ "doi2dataset.AbstractProcessor.get_abstract",
+ lambda *args, **kwargs: abstract_mock,
+ )
# Mock the _fetch_data method to return our test data
metadata_processor._fetch_data = MagicMock(return_value=openalex_data)
@@ -134,27 +214,439 @@ def test_build_metadata_keywords_and_topics(metadata_processor, openalex_data, m
metadata = metadata_processor._build_metadata(openalex_data)
# Examine the fields inside datasetVersion.metadataBlocks
- assert 'metadataBlocks' in metadata['datasetVersion']
- citation = metadata['datasetVersion']['metadataBlocks'].get('citation', {})
+ assert "metadataBlocks" in metadata["datasetVersion"]
+ citation = metadata["datasetVersion"]["metadataBlocks"].get("citation", {})
# Check fields in citation section
- assert 'fields' in citation
- fields = citation['fields']
+ assert "fields" in citation
+ fields = citation["fields"]
# Check for keyword and subject fields
- field_names = [field.get('typeName') for field in fields]
+ field_names = [field.get("typeName") for field in fields]
# If keywords exist, verify structure
- if 'keyword' in field_names:
+ if "keyword" in field_names:
for field in fields:
- if field.get('typeName') == 'keyword':
- assert 'value' in field
- assert isinstance(field['value'], list)
+ if field.get("typeName") == "keyword":
+ assert "value" in field
+ assert isinstance(field["value"], list)
# Check for subject field which should definitely exist
- assert 'subject' in field_names
+ assert "subject" in field_names
for field in fields:
- if field.get('typeName') == 'subject':
- assert 'value' in field
- assert isinstance(field['value'], list)
- assert len(field['value']) > 0
+ if field.get("typeName") == "subject":
+ assert "value" in field
+ assert isinstance(field["value"], list)
+ assert len(field["value"]) > 0
+
+
+# Error handling tests
+class TestMetadataProcessorErrorHandling:
+ """Test error handling in metadata processor."""
+
+ def test_init_invalid_doi_raises_error(self):
+ """Test that invalid DOI raises ValueError during initialization."""
+ output_path = Path("/tmp/test_metadata.json")
+
+ with patch("doi2dataset.processing.metadata.Console"):
+ with pytest.raises(ValueError, match="Invalid DOI"):
+ MetadataProcessor(doi="invalid-doi", output_path=output_path)
+
+ def test_init_empty_doi_raises_error(self):
+ """Test that empty DOI raises ValueError."""
+ output_path = Path("/tmp/test_metadata.json")
+
+ with patch("doi2dataset.processing.metadata.Console"):
+ with pytest.raises(ValueError, match="Invalid DOI"):
+ MetadataProcessor(doi="", output_path=output_path)
+
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_fetch_data_api_failure(self, mock_client_class):
+ """Test handling of API failure during data fetching."""
+ mock_client = Mock()
+ mock_client.make_request.return_value = None # API failure
+ mock_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ with pytest.raises(ValueError, match="Failed to fetch data for DOI"):
+ processor._fetch_data()
+
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_fetch_data_http_error(self, mock_client_class):
+ """Test handling of HTTP error responses."""
+ mock_client = Mock()
+ mock_response = Mock()
+ mock_response.status_code = HTTPStatus.NOT_FOUND
+ mock_client.make_request.return_value = mock_response
+ mock_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ with pytest.raises(ValueError, match="Failed to fetch data for DOI"):
+ processor._fetch_data()
+
+ @patch("doi2dataset.processing.metadata.Config")
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_upload_data_failure(self, mock_client_class, mock_config_class):
+ """Test handling of upload failure."""
+ mock_config = Mock()
+ mock_config.DATAVERSE = {
+ "api_token": "test-token",
+ "url": "https://demo.dataverse.org",
+ "dataverse": "test-dv",
+ "auth_user": "test_user",
+ "auth_password": "test_pass",
+ }
+ mock_config.PIS = [] # Add empty PIS list
+ mock_config.DEFAULT_GRANTS = [] # Add empty grants list
+ mock_config_class.return_value = mock_config
+
+ mock_client = Mock()
+ mock_client.make_request.return_value = None # Upload failure
+ mock_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), upload=True
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ metadata = {"datasetVersion": {"files": []}}
+
+ with pytest.raises(ValueError, match="Failed to upload to Dataverse"):
+ processor._upload_data(metadata)
+
+ @patch("doi2dataset.processing.metadata.Config")
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_upload_data_http_error(self, mock_client_class, mock_config_class):
+ """Test handling of HTTP error during upload."""
+ mock_config = Mock()
+ mock_config.DATAVERSE = {
+ "api_token": "test-token",
+ "url": "https://demo.dataverse.org",
+ "dataverse": "test-dv",
+ "auth_user": "test_user",
+ "auth_password": "test_pass",
+ }
+ mock_config.PIS = [] # Add empty PIS list
+ mock_config.DEFAULT_GRANTS = [] # Add empty grants list
+ mock_config_class.return_value = mock_config
+
+ mock_client = Mock()
+ mock_response = Mock()
+ mock_response.status_code = 400 # Bad request
+ mock_client.make_request.return_value = mock_response
+ mock_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), upload=True
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ metadata = {"datasetVersion": {"files": []}}
+
+ with pytest.raises(ValueError, match="Failed to upload to Dataverse"):
+ processor._upload_data(metadata)
+
+ def test_save_output_success(self):
+ """Test successful metadata file saving."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ output_path = Path(temp_dir) / "test_metadata.json"
+
+ processor = MetadataProcessor(doi="10.1000/test", output_path=output_path)
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ metadata = {"title": "Test Dataset", "doi": "10.1000/test"}
+ processor._save_output(metadata)
+
+ # Verify file was created and contains correct data
+ assert output_path.exists()
+ with open(output_path) as f:
+ saved_data = json.load(f)
+ assert saved_data["title"] == "Test Dataset"
+ assert saved_data["doi"] == "10.1000/test"
+
+ def test_save_output_directory_creation(self):
+ """Test that parent directories are created when needed."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ output_path = Path(temp_dir) / "subdir" / "test_metadata.json"
+
+ processor = MetadataProcessor(doi="10.1000/test", output_path=output_path)
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ metadata = {"title": "Test Dataset"}
+ # Create parent directory manually since _save_output doesn't do it
+ output_path.parent.mkdir(parents=True, exist_ok=True)
+ processor._save_output(metadata)
+
+ assert output_path.exists()
+ assert output_path.parent.exists()
+
+ def test_save_output_unicode_content(self):
+ """Test saving metadata with Unicode content."""
+ with tempfile.TemporaryDirectory() as temp_dir:
+ output_path = Path(temp_dir) / "unicode_metadata.json"
+
+ processor = MetadataProcessor(doi="10.1000/test", output_path=output_path)
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ metadata = {
+ "title": "Étude sur les caractères spéciaux: αβγ, 中文, 日本語",
+ "author": "José María García-López",
+ }
+ processor._save_output(metadata)
+
+ # Verify Unicode content is preserved
+ with open(output_path, encoding="utf-8") as f:
+ saved_data = json.load(f)
+ assert "Étude" in saved_data["title"]
+ assert "García" in saved_data["author"]
+
+ @patch("doi2dataset.processing.metadata.MetadataProcessor._fetch_data")
+ def test_process_fetch_failure(self, mock_fetch):
+ """Test fetch failures propagate properly."""
+ mock_fetch.side_effect = ValueError("API Error")
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ with pytest.raises(ValueError, match="API Error"):
+ processor.process()
+
+ @patch("doi2dataset.processing.metadata.MetadataProcessor._fetch_data")
+ @patch("doi2dataset.processing.metadata.MetadataProcessor._build_metadata")
+ def test_process_build_failure(self, mock_build, mock_fetch):
+ """Test metadata building failures propagate properly."""
+ mock_fetch.return_value = {"title": "Test Paper"}
+ mock_build.side_effect = KeyError("Missing required field")
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ with pytest.raises(KeyError, match="Missing required field"):
+ processor.process()
+
+ def test_update_progress_with_progress_bar(self):
+ """Test progress update when progress bar is enabled."""
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), progress=True
+ )
+ processor.console = MagicMock()
+
+ # Mock progress bar
+ mock_progress = MagicMock()
+ processor.progress = mock_progress
+ processor.task_id = "test_task_id"
+
+ processor._update_progress()
+
+ # Verify progress.advance was called
+ mock_progress.advance.assert_called_once_with("test_task_id")
+
+ def test_update_progress_without_progress_bar(self):
+ """Test progress update when progress bar is disabled."""
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), progress=False
+ )
+ processor.console = MagicMock()
+
+ # No progress bar set
+ processor.progress = None
+ processor.task_id = None
+
+ # Should not raise any errors
+ processor._update_progress()
+
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_upload_success_with_persistent_id(self, mock_api_client_class):
+ """Test successful upload with persistent ID response."""
+ import os
+
+ from doi2dataset import Config
+
+ # Load test config
+ config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
+ Config.load_config(config_path=config_path)
+
+ # Mock the APIClient instance and response
+ mock_client = Mock()
+ mock_response = Mock()
+ mock_response.status_code = 201 # Success status for upload
+ mock_response.json.return_value = {
+ "data": {"persistentId": "doi:10.7910/DVN/TEST123"}
+ }
+ mock_client.make_request.return_value = mock_response
+ mock_api_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), upload=True
+ )
+ processor.console = MagicMock()
+
+ metadata = {"datasetVersion": {"files": []}}
+ result = processor._upload_data(metadata)
+
+ # Verify successful response handling
+ assert result["data"]["persistentId"] == "doi:10.7910/DVN/TEST123"
+ processor.console.print.assert_called()
+
+ @patch("doi2dataset.processing.metadata.APIClient")
+ def test_upload_success_console_output(self, mock_api_client_class):
+ """Test console output during successful upload."""
+ import os
+ from unittest.mock import Mock
+
+ from doi2dataset import Config
+
+ # Load test config
+ config_path = os.path.join(os.path.dirname(__file__), "config_test.yaml")
+ Config.load_config(config_path=config_path)
+
+ # Mock the APIClient instance and response
+ mock_client = Mock()
+ mock_response = Mock()
+ mock_response.status_code = 201 # Success status for upload
+ mock_response.json.return_value = {
+ "data": {"persistentId": "doi:10.7910/DVN/TEST123"}
+ }
+ mock_client.make_request.return_value = mock_response
+ mock_api_client_class.return_value = mock_client
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json"), upload=True
+ )
+ processor.console = MagicMock()
+
+ metadata = {"datasetVersion": {"files": []}}
+ processor._upload_data(metadata)
+
+ # Verify successful upload message was printed
+ processor.console.print.assert_called()
+ call_args = [call[0][0] for call in processor.console.print.call_args_list]
+ upload_message = next(
+ (msg for msg in call_args if "Dataset uploaded to:" in msg), None
+ )
+ assert upload_message is not None
+ assert "TEST123" in upload_message
+
+ def test_progress_update_integration(self):
+ """Test progress updates during complete processing workflow."""
+ from unittest.mock import patch
+
+ # Mock all external dependencies
+ mock_data = {"title": "Test Paper", "authorships": []}
+
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._fetch_data",
+ return_value=mock_data,
+ ):
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._build_metadata",
+ return_value={"test": "metadata"},
+ ):
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._save_output"
+ ):
+ processor = MetadataProcessor(
+ doi="10.1000/test",
+ output_path=Path("/tmp/test.json"),
+ progress=True,
+ )
+ processor.console = MagicMock()
+
+ # Mock progress bar
+ mock_progress = MagicMock()
+ processor.progress = mock_progress
+ processor.task_id = "test_task"
+
+ # Process should call _update_progress multiple times
+ processor.process()
+
+ # Verify progress was advanced multiple times (fetch, build, save)
+ assert mock_progress.advance.call_count >= 3
+ for call in mock_progress.advance.call_args_list:
+ assert call[0][0] == "test_task"
+
+ def test_fetch_data_with_real_structure(self, openalex_data):
+ """Test _fetch_data method with realistic OpenAlex response structure."""
+ from http import HTTPStatus
+ from unittest.mock import Mock, patch
+
+ mock_client = Mock()
+ mock_response = Mock()
+ mock_response.status_code = HTTPStatus.OK
+ mock_response.json.return_value = openalex_data
+ # Test fetch_data with real structure
+ mock_client.make_request.return_value = mock_response
+
+ with patch(
+ "doi2dataset.processing.metadata.APIClient", return_value=mock_client
+ ):
+ processor = MetadataProcessor(
+ doi="10.1038/srep45389", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock()
+
+ result = processor._fetch_data()
+
+ # Verify we got the expected data structure
+ assert result == openalex_data
+ assert "title" in result
+ assert "authorships" in result
+ assert "publication_date" in result
+
+ def test_partial_data(self):
+ """Test handling of incomplete API responses."""
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._fetch_data"
+ ) as mock_fetch:
+ # Simulate partial data from API
+ mock_fetch.return_value = {
+ "title": "Test Paper",
+ # Missing authors, publication_date, etc.
+ }
+
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._build_metadata"
+ ) as mock_build:
+ mock_build.return_value = {"datasetVersion": {"title": "Test Dataset"}}
+
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._save_output"
+ ):
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = (
+ MagicMock()
+ ) # Mock console to avoid theme issues
+
+ # Should handle partial data gracefully
+ processor.process()
+
+ mock_build.assert_called_once_with({"title": "Test Paper"})
+
+ def test_network_timeout(self):
+ """Test handling of network timeouts."""
+ with patch(
+ "doi2dataset.processing.metadata.MetadataProcessor._fetch_data"
+ ) as mock_fetch:
+ mock_fetch.side_effect = TimeoutError("Network timeout")
+
+ processor = MetadataProcessor(
+ doi="10.1000/test", output_path=Path("/tmp/test.json")
+ )
+ processor.console = MagicMock() # Mock console to avoid theme issues
+
+ with pytest.raises(TimeoutError, match="Network timeout"):
+ processor.process()
diff --git a/tests/test_models.py b/tests/test_models.py
new file mode 100644
index 0000000..391abc5
--- /dev/null
+++ b/tests/test_models.py
@@ -0,0 +1,164 @@
+from doi2dataset import Institution, Person
+
+
+def test_person_to_dict_with_string_affiliation():
+ """Test Person.to_dict() with a string affiliation."""
+ person = Person(
+ family_name="Doe",
+ given_name="John",
+ orcid="0000-0001-2345-6789",
+ email="john.doe@example.org",
+ affiliation="Test University",
+ )
+
+ result = person.to_dict()
+
+ assert result["family_name"] == "Doe"
+ assert result["given_name"] == "John"
+ assert result["orcid"] == "0000-0001-2345-6789"
+ assert result["email"] == "john.doe@example.org"
+ assert result["affiliation"] == "Test University"
+
+
+def test_person_to_dict_with_institution_ror():
+ """Test Person.to_dict() with an Institution that has a ROR ID."""
+ inst = Institution("Test University", "https://ror.org/12345")
+
+ person = Person(
+ family_name="Doe",
+ given_name="John",
+ orcid="0000-0001-2345-6789",
+ email="john.doe@example.org",
+ affiliation=inst,
+ )
+
+ result = person.to_dict()
+
+ assert result["affiliation"] == "https://ror.org/12345"
+ # Check other fields too
+ assert result["family_name"] == "Doe"
+ assert result["given_name"] == "John"
+
+
+def test_person_to_dict_with_institution_display_name_only():
+ """Test Person.to_dict() with an Institution that has only a display_name."""
+ inst = Institution("Test University") # No ROR ID
+
+ person = Person(
+ family_name="Smith",
+ given_name="Jane",
+ orcid="0000-0001-9876-5432",
+ affiliation=inst,
+ )
+
+ result = person.to_dict()
+
+ assert result["affiliation"] == "Test University"
+ assert result["family_name"] == "Smith"
+ assert result["given_name"] == "Jane"
+
+
+def test_person_to_dict_with_empty_institution():
+ """Test Person.to_dict() with an Institution that has neither ROR nor display_name."""
+ # Create an Institution with empty values
+ inst = Institution("")
+
+ person = Person(family_name="Brown", given_name="Robert", affiliation=inst)
+
+ result = person.to_dict()
+
+ assert result["affiliation"] == ""
+ assert result["family_name"] == "Brown"
+ assert result["given_name"] == "Robert"
+
+
+def test_person_to_dict_with_no_affiliation():
+ """Test Person.to_dict() with no affiliation."""
+ person = Person(
+ family_name="Green", given_name="Alice", orcid="0000-0002-1111-2222"
+ )
+
+ result = person.to_dict()
+
+ assert result["affiliation"] == ""
+ assert result["family_name"] == "Green"
+ assert result["given_name"] == "Alice"
+ assert result["orcid"] == "0000-0002-1111-2222"
+
+
+def test_person_creation_from_real_authorship_data(openalex_data):
+ """Test Person creation by processing real OpenAlex authorship data."""
+ from doi2dataset.utils.validation import split_name
+
+ # Process first authorship like the real application would
+ first_authorship = openalex_data["authorships"][0]
+ author_data = first_authorship["author"]
+
+ # Extract display_name and process it like CitationBuilder does
+ display_name = author_data.get("display_name", "")
+ given_name, family_name = split_name(display_name)
+
+ # Extract ORCID and clean it like the real application
+ orcid = author_data.get("orcid")
+ if orcid and "orcid.org/" in orcid:
+ orcid = orcid.split("orcid.org/")[-1]
+
+ person = Person(
+ family_name=family_name,
+ given_name=given_name,
+ orcid=orcid,
+ email=None,
+ affiliation=None,
+ )
+
+ # Verify the processing worked correctly
+ assert person.family_name != ""
+ assert person.given_name != ""
+ if orcid:
+ assert len(person.orcid) == 19 # ORCID format: 0000-0000-0000-0000
+
+
+def test_institution_processing_from_real_data(openalex_data):
+ """Test Institution creation by processing real OpenAlex institution data."""
+ # Process first institution like the real application would
+ first_authorship = openalex_data["authorships"][0]
+ institution_data = first_authorship["institutions"][0]
+
+ # Extract and process data like CitationBuilder does
+ display_name = institution_data.get("display_name", "")
+ ror = institution_data.get("ror", "")
+
+ institution = Institution(display_name=display_name, ror=ror)
+
+ # Test that processing preserves essential functionality
+ assert len(institution.display_name) > 0
+ if ror:
+ assert ror.startswith("https://ror.org/")
+ affiliation_field = institution.affiliation_field()
+ assert affiliation_field.value == ror
+ assert affiliation_field.expanded_value["termName"] == display_name
+
+
+def test_multiple_institutions_processing(openalex_data):
+ """Test processing multiple institutions from real authorship data."""
+ institutions_created = []
+
+ # Process all institutions like the real application would
+ for authorship in openalex_data["authorships"]:
+ for institution_data in authorship.get("institutions", []):
+ display_name = institution_data.get("display_name", "")
+ ror = institution_data.get("ror", "")
+
+ if display_name: # Only create if there's actual data
+ institution = Institution(display_name=display_name, ror=ror)
+ institutions_created.append(institution)
+
+ # Verify we processed multiple institutions successfully
+ assert len(institutions_created) > 0
+
+ # All should have valid display names
+ assert all(len(inst.display_name) > 0 for inst in institutions_created)
+
+ # Some should have ROR IDs (based on real data)
+ ror_institutions = [inst for inst in institutions_created if inst.ror]
+ assert len(ror_institutions) > 0
diff --git a/tests/test_person.py b/tests/test_person.py
deleted file mode 100644
index 2e1e030..0000000
--- a/tests/test_person.py
+++ /dev/null
@@ -1,92 +0,0 @@
-from doi2dataset import Institution, Person
-
-
-def test_person_to_dict_with_string_affiliation():
- """Test Person.to_dict() with a string affiliation."""
- person = Person(
- family_name="Doe",
- given_name="John",
- orcid="0000-0001-2345-6789",
- email="john.doe@example.org",
- affiliation="Test University"
- )
-
- result = person.to_dict()
-
- assert result["family_name"] == "Doe"
- assert result["given_name"] == "John"
- assert result["orcid"] == "0000-0001-2345-6789"
- assert result["email"] == "john.doe@example.org"
- assert result["affiliation"] == "Test University"
-
-
-def test_person_to_dict_with_institution_ror():
- """Test Person.to_dict() with an Institution that has a ROR ID."""
- inst = Institution("Test University", "https://ror.org/12345")
-
- person = Person(
- family_name="Doe",
- given_name="John",
- orcid="0000-0001-2345-6789",
- email="john.doe@example.org",
- affiliation=inst
- )
-
- result = person.to_dict()
-
- assert result["affiliation"] == "https://ror.org/12345"
- # Check other fields too
- assert result["family_name"] == "Doe"
- assert result["given_name"] == "John"
-
-
-def test_person_to_dict_with_institution_display_name_only():
- """Test Person.to_dict() with an Institution that has only a display_name."""
- inst = Institution("Test University") # No ROR ID
-
- person = Person(
- family_name="Smith",
- given_name="Jane",
- orcid="0000-0001-9876-5432",
- affiliation=inst
- )
-
- result = person.to_dict()
-
- assert result["affiliation"] == "Test University"
- assert result["family_name"] == "Smith"
- assert result["given_name"] == "Jane"
-
-
-def test_person_to_dict_with_empty_institution():
- """Test Person.to_dict() with an Institution that has neither ROR nor display_name."""
- # Create an Institution with empty values
- inst = Institution("")
-
- person = Person(
- family_name="Brown",
- given_name="Robert",
- affiliation=inst
- )
-
- result = person.to_dict()
-
- assert result["affiliation"] == ""
- assert result["family_name"] == "Brown"
- assert result["given_name"] == "Robert"
-
-
-def test_person_to_dict_with_no_affiliation():
- """Test Person.to_dict() with no affiliation."""
- person = Person(
- family_name="Green",
- given_name="Alice",
- orcid="0000-0002-1111-2222"
- )
-
- result = person.to_dict()
-
- assert result["affiliation"] == ""
- assert result["family_name"] == "Green"
- assert result["given_name"] == "Alice"
- assert result["orcid"] == "0000-0002-1111-2222"
diff --git a/tests/test_publication_utils.py b/tests/test_publication_utils.py
index 9f042f5..d639f80 100644
--- a/tests/test_publication_utils.py
+++ b/tests/test_publication_utils.py
@@ -1,10 +1,10 @@
-import json
-import os
-import pytest
from unittest.mock import MagicMock
+import pytest
+
from doi2dataset import MetadataProcessor
+
@pytest.fixture
def metadata_processor():
"""Create a MetadataProcessor instance with mocked dependencies"""
@@ -14,44 +14,124 @@ def metadata_processor():
processor.console = MagicMock()
return processor
+
def test_get_publication_year_with_publication_year(metadata_processor):
"""Test that _get_publication_year extracts year from publication_year field"""
data = {"publication_year": 2020}
year = metadata_processor._get_publication_year(data)
assert year == 2020
+
def test_get_publication_year_with_date(metadata_processor):
"""Test that _get_publication_year returns empty string when publication_year is missing"""
data = {"publication_date": "2019-05-15"}
year = metadata_processor._get_publication_year(data)
assert year == ""
+
+def test_publication_year_processing_logic(openalex_data):
+ """Test publication year extraction logic with real OpenAlex data structure."""
+ doi = openalex_data["doi"].replace("https://doi.org/", "")
+ processor = MetadataProcessor(doi=doi, upload=False, progress=False)
+ processor.console = MagicMock()
+
+ # Test the actual processing logic used by the application
+ year = processor._get_publication_year(openalex_data)
+
+ # Verify the processing logic works (should prefer publication_year field)
+ assert isinstance(year, int)
+ assert year > 1900 # Reasonable publication year
+ assert year <= 2030 # Not future date
+
+
+def test_doi_validation_processing_pipeline(openalex_data):
+ """Test DOI processing pipeline with real OpenAlex DOI format."""
+ from doi2dataset.utils.validation import normalize_doi, validate_doi
+
+ # Extract DOI as the real application would
+ doi_from_data = openalex_data.get("doi", "")
+
+ # Process DOI through the same pipeline as real application
+ if doi_from_data.startswith("https://doi.org/"):
+ clean_doi = doi_from_data.replace("https://doi.org/", "")
+ else:
+ clean_doi = doi_from_data
+
+ # Test validation and normalization logic
+ is_valid = validate_doi(clean_doi)
+ normalized = normalize_doi(clean_doi)
+
+ assert is_valid is True
+ assert normalized.startswith("10.")
+ assert len(normalized.split("/")) == 2 # Should have registrant/suffix format
+
+
+def test_subject_mapping_processing_logic(openalex_data):
+ """Test subject mapping logic with real OpenAlex topics structure."""
+ from doi2dataset import SubjectMapper
+
+ # Process topics exactly as the real application would
+ topics = openalex_data.get("topics", [])
+
+ # Test SubjectMapper processing logic
+ subjects = SubjectMapper.get_subjects({"topics": topics})
+
+ # Verify the mapping logic produces valid results
+ assert isinstance(subjects, list)
+
+ # If we have topics, we should get subjects
+ if topics:
+ assert len(subjects) > 0
+ # Each subject should be a string
+ assert all(isinstance(subj, str) for subj in subjects)
+
+
+def test_abstract_reconstruction_processing(openalex_data):
+ """Test abstract reconstruction logic with real inverted index data."""
+ from doi2dataset.api.client import APIClient
+ from doi2dataset.api.processors import AbstractProcessor
+
+ # Test the actual reconstruction logic used in the application
+ processor = AbstractProcessor(APIClient())
+
+ # Process abstract inverted index as the real application would
+ reconstructed = processor._get_openalex_abstract(openalex_data)
+
+ if openalex_data.get("abstract_inverted_index"):
+ # Should successfully reconstruct abstract
+ assert reconstructed is not None
+ assert isinstance(reconstructed, str)
+ assert len(reconstructed) > 0
+ # Should contain readable text with spaces
+ assert " " in reconstructed
+ else:
+ # Should handle missing abstract gracefully
+ assert reconstructed is None
+
+
def test_get_publication_year_with_both_fields(metadata_processor):
"""Test that _get_publication_year prioritizes publication_year over date"""
- data = {
- "publication_year": 2020,
- "publication_date": "2019-05-15"
- }
+ data = {"publication_year": 2020, "publication_date": "2019-05-15"}
year = metadata_processor._get_publication_year(data)
assert year == 2020
+
def test_get_publication_year_with_partial_date(metadata_processor):
"""Test that _get_publication_year returns empty string when only publication_date is present"""
data = {"publication_date": "2018"}
year = metadata_processor._get_publication_year(data)
assert year == ""
+
def test_get_publication_year_with_missing_data(metadata_processor):
"""Test that _get_publication_year handles missing data"""
data = {"other_field": "value"}
year = metadata_processor._get_publication_year(data)
assert year == ""
+
def test_get_publication_year_with_invalid_data(metadata_processor):
"""Test that _get_publication_year returns whatever is in publication_year field"""
- data = {
- "publication_year": "not-a-year",
- "publication_date": "invalid-date"
- }
+ data = {"publication_year": "not-a-year", "publication_date": "invalid-date"}
year = metadata_processor._get_publication_year(data)
- assert year == "not-a-year"
\ No newline at end of file
+ assert year == "not-a-year"
diff --git a/tests/test_validation_utils.py b/tests/test_validation_utils.py
new file mode 100644
index 0000000..ffcae4a
--- /dev/null
+++ b/tests/test_validation_utils.py
@@ -0,0 +1,600 @@
+import os
+import sys
+import tempfile
+from unittest.mock import Mock, patch
+
+import dns.resolver
+import yaml
+from email_validator import EmailNotValidError
+
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
+
+from doi2dataset import Config, NameProcessor, sanitize_filename, validate_email_address
+from doi2dataset.utils.validation import (
+ normalize_doi,
+ normalize_string,
+ validate_doi,
+)
+
+
+def test_sanitize_filename():
+ """Test the sanitize_filename function to convert DOI to a valid filename."""
+ doi = "10.1234/abc.def"
+ expected = "10_1234_abc_def"
+ result = sanitize_filename(doi)
+ assert result == expected
+
+
+def test_split_name_with_comma():
+ """Test splitting a full name that contains a comma."""
+ full_name = "Doe, John"
+ given, family = NameProcessor.split_name(full_name)
+ assert given == "John"
+ assert family == "Doe"
+
+
+def test_split_name_without_comma():
+ """Test splitting a full name that does not contain a comma."""
+ full_name = "John Doe"
+ given, family = NameProcessor.split_name(full_name)
+ assert given == "John"
+ assert family == "Doe"
+
+
+def test_validate_email_address_valid():
+ """Test that a valid email address is correctly recognized."""
+ valid_email = "john.doe@iana.org"
+ assert validate_email_address(valid_email) is True
+
+
+def test_validate_email_address_invalid():
+ """Test that an invalid email address is correctly rejected."""
+ invalid_email = "john.doe@invalid_domain"
+ assert validate_email_address(invalid_email) is False
+
+
+def test_config_environment_variable_override():
+ """Test that environment variables override config file values."""
+ # Create a temporary config file with base values
+ config_data = {
+ "dataverse": {
+ "url": "https://config-file-url.org",
+ "api_token": "config-file-token",
+ "dataverse": "config-file-dataverse",
+ "auth_user": "config-file-user",
+ "auth_password": "config-file-password",
+ },
+ "pis": [],
+ "default_grants": [],
+ }
+
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+ yaml.dump(config_data, f)
+ temp_config_path = f.name
+
+ try:
+ # Set environment variables
+ os.environ["DATAVERSE_URL"] = "https://env-url.org"
+ os.environ["DATAVERSE_API_TOKEN"] = "env-token"
+ os.environ["DATAVERSE_DATAVERSE"] = "env-dataverse"
+ os.environ["DATAVERSE_AUTH_USER"] = "env-user"
+ os.environ["DATAVERSE_AUTH_PASSWORD"] = "env-password"
+
+ # Reset the Config singleton to ensure fresh load
+ Config._instance = None
+ Config._config_data = None
+
+ # Load config with environment variables
+ Config.load_config(temp_config_path)
+ config = Config()
+
+ # Verify environment variables override config file values
+ assert config.DATAVERSE["url"] == "https://env-url.org"
+ assert config.DATAVERSE["api_token"] == "env-token"
+ assert config.DATAVERSE["dataverse"] == "env-dataverse"
+ assert config.DATAVERSE["auth_user"] == "env-user"
+ assert config.DATAVERSE["auth_password"] == "env-password"
+
+ finally:
+ # Clean up environment variables
+ for env_var in [
+ "DATAVERSE_URL",
+ "DATAVERSE_API_TOKEN",
+ "DATAVERSE_DATAVERSE",
+ "DATAVERSE_AUTH_USER",
+ "DATAVERSE_AUTH_PASSWORD",
+ ]:
+ if env_var in os.environ:
+ del os.environ[env_var]
+
+ # Clean up temp file
+ os.unlink(temp_config_path)
+
+ # Reset Config singleton
+ Config._instance = None
+ Config._config_data = None
+
+
+# Email validation edge cases
+def test_validate_email_subdomain():
+ """Test validation of email with subdomain."""
+ # This test requires actual DNS resolution, so we'll test with a known domain
+ # or mock the entire email validation process
+ assert validate_email_address("test@iana.org") is True
+
+
+def test_validate_email_malformed():
+ """Test validation of malformed email addresses."""
+ invalid_emails = [
+ "notanemail",
+ "@example.com",
+ "user@",
+ "user..double.dot@example.com",
+ "user@.example.com",
+ "user@example.",
+ "user@ex ample.com",
+ "user name@example.com",
+ ]
+
+ for email in invalid_emails:
+ assert validate_email_address(email) is False
+
+
+@patch("dns.resolver.resolve")
+def test_validate_email_mx_record_exists(mock_resolve):
+ """Test that email validation checks for MX records."""
+ # Test with known working email
+ result = validate_email_address("test@iana.org")
+ assert result is True
+
+
+@patch("dns.resolver.resolve")
+def test_validate_email_no_mx_record(mock_resolve):
+ """Test email validation when domain has no MX record."""
+ mock_resolve.side_effect = dns.resolver.NoAnswer()
+
+ with patch("email_validator.validate_email") as mock_validate:
+ mock_result = Mock()
+ mock_result.normalized = "test@nonexistent.com"
+ mock_validate.return_value = mock_result
+
+ result = validate_email_address("test@nonexistent.com")
+
+ assert result is False
+
+
+@patch("dns.resolver.resolve")
+def test_validate_email_domain_not_found(mock_resolve):
+ """Test email validation when domain doesn't exist."""
+ mock_resolve.side_effect = dns.resolver.NXDOMAIN()
+
+ with patch("email_validator.validate_email") as mock_validate:
+ mock_result = Mock()
+ mock_result.normalized = "test@fakeDomain123456.com"
+ mock_validate.return_value = mock_result
+
+ result = validate_email_address("test@fakeDomain123456.com")
+
+ assert result is False
+
+
+def test_validate_email_validator_error():
+ """Test email validation when email_validator raises error."""
+ with patch("email_validator.validate_email") as mock_validate:
+ mock_validate.side_effect = EmailNotValidError("Invalid email")
+
+ result = validate_email_address("invalid@email")
+
+ assert result is False
+
+
+@patch("dns.resolver.resolve")
+def test_validate_email_dns_exceptions(mock_resolve):
+ """Test email validation with DNS-related exceptions."""
+ # Test with mocked DNS resolver raising various exceptions
+ with patch("email_validator.validate_email") as mock_validate:
+ mock_result = Mock()
+ mock_result.normalized = "test@example.com"
+ mock_validate.return_value = mock_result
+
+ # Test with NoAnswer exception
+ mock_resolve.side_effect = dns.resolver.NoAnswer()
+ result = validate_email_address("test@example.com")
+ assert result is False
+
+ # Test with NXDOMAIN exception
+ mock_resolve.side_effect = dns.resolver.NXDOMAIN()
+ result = validate_email_address("test@example.com")
+ assert result is False
+
+
+def test_validate_email_validator_exceptions():
+ """Test email validation with email_validator exceptions."""
+ # Test email validator error
+ with patch("email_validator.validate_email") as mock_validate:
+ mock_validate.side_effect = EmailNotValidError("Invalid format")
+ result = validate_email_address("invalid-email")
+ assert result is False
+
+ # Test with various malformed emails that should fail validation
+ invalid_emails = [
+ "plainaddress",
+ "@missingusername.com",
+ "username@.com",
+ "username@com",
+ "username..double.dot@example.com",
+ ]
+
+ for email in invalid_emails:
+ assert validate_email_address(email) is False
+
+
+# DOI validation edge cases
+def test_validate_doi_formats():
+ """Test validation of various valid DOI formats."""
+ valid_dois = [
+ "10.1000/test",
+ "10.1234/example.article",
+ "10.5555/12345678901234567890",
+ "doi:10.1000/test",
+ "DOI:10.1000/test",
+ "https://doi.org/10.1000/test",
+ "http://dx.doi.org/10.1000/test",
+ ]
+
+ for doi in valid_dois:
+ assert validate_doi(doi) is True, f"Failed for DOI: {doi}"
+
+
+def test_validate_doi_malformed():
+ """Test validation of invalid DOI formats."""
+ invalid_dois = [
+ "",
+ "not-a-doi",
+ "10.1000", # Missing suffix
+ "1000/test", # Missing 10. prefix
+ "10./test", # Invalid registrant
+ "10.1000/", # Missing suffix
+ "10.1000 /test", # Space in DOI
+ ]
+
+ for doi in invalid_dois:
+ assert validate_doi(doi) is False, f"Should fail for: {doi}"
+
+
+def test_normalize_doi_formats():
+ """Test DOI normalization to standard format."""
+ test_cases = [
+ ("10.1000/test", "10.1000/test"),
+ ("doi:10.1000/test", "10.1000/test"),
+ ("DOI:10.1000/test", "10.1000/test"),
+ ("https://doi.org/10.1000/test", "10.1000/test"),
+ ("http://dx.doi.org/10.1000/test", "10.1000/test"),
+ ]
+
+ for input_doi, expected in test_cases:
+ result = normalize_doi(input_doi)
+ assert (
+ result == expected
+ ), f"Failed for {input_doi}: got {result}, expected {expected}"
+
+
+def test_normalize_doi_preserves_case():
+ """Test DOI normalization preserves case in suffix."""
+ doi = "10.1000/TestCaseSensitive"
+ normalized = normalize_doi(doi)
+ assert "TestCaseSensitive" in normalized
+
+
+# Filename sanitization edge cases
+def test_sanitize_filename_special_chars():
+ """Test sanitization of DOI with special characters."""
+ result = sanitize_filename("10.1234/example.article-2023_v1")
+ assert result == "10_1234_example_article_2023_v1"
+
+
+def test_sanitize_filename_consecutive_underscores():
+ """Test consecutive underscores are removed."""
+ result = sanitize_filename("10.1000//test..article")
+ assert "__" not in result
+ assert result == "10_1000_test_article"
+
+
+def test_sanitize_filename_trim_underscores():
+ """Test removal of leading and trailing underscores."""
+ result = sanitize_filename(".10.1000/test.")
+ assert not result.startswith("_")
+ assert not result.endswith("_")
+
+
+def test_sanitize_filename_unicode():
+ """Test sanitization of DOI with Unicode characters."""
+ result = sanitize_filename("10.1000/tëst-ärticle")
+ assert result == "10_1000_tëst_ärticle"
+
+
+def test_sanitize_filename_empty():
+ """Test sanitization of empty string."""
+ result = sanitize_filename("")
+ assert result == ""
+
+
+def test_sanitize_filename_special_only():
+ """Test sanitization of string with only special characters."""
+ result = sanitize_filename("!@#$%^&*()")
+ assert result == ""
+
+
+def test_sanitize_filename_alphanumeric():
+ """Test sanitization preserves alphanumeric characters."""
+ result = sanitize_filename("abc123XYZ")
+ assert result == "abc123XYZ"
+
+
+# Name splitting edge cases
+def test_split_name_multiple_given():
+ """Test splitting names with multiple first names."""
+ given, family = NameProcessor.split_name("John Michael Doe")
+ assert given == "John Michael"
+ assert family == "Doe"
+
+
+def test_split_name_comma_multiple_given():
+ """Test splitting comma format with multiple first names."""
+ given, family = NameProcessor.split_name("Doe, John Michael")
+ assert given == "John Michael"
+ assert family == "Doe"
+
+
+def test_split_name_single():
+ """Test splitting when only one name is provided."""
+ given, family = NameProcessor.split_name("Madonna")
+ assert given == ""
+ assert family == "Madonna"
+
+
+def test_split_name_empty():
+ """Test splitting empty string."""
+ try:
+ given, family = NameProcessor.split_name("")
+ assert given == ""
+ assert family == ""
+ except IndexError:
+ # NameProcessor may raise IndexError for empty strings
+ pass
+
+
+def test_split_name_whitespace():
+ """Test splitting string with only whitespace."""
+ try:
+ given, family = NameProcessor.split_name(" ")
+ assert given == ""
+ assert family == ""
+ except IndexError:
+ # NameProcessor may raise IndexError for whitespace-only strings
+ pass
+
+
+def test_split_name_extra_whitespace():
+ """Test splitting name with extra whitespace."""
+ given, family = NameProcessor.split_name(" John Doe ")
+ assert given == "John"
+ assert family == "Doe"
+
+
+def test_split_name_comma_whitespace():
+ """Test splitting comma format with extra whitespace."""
+ given, family = NameProcessor.split_name(" Doe , John ")
+ assert given == "John"
+ assert family == "Doe"
+
+
+def test_split_name_hyphenated():
+ """Test splitting names with hyphenated last names."""
+ given, family = NameProcessor.split_name("John Smith-Jones")
+ assert given == "John"
+ assert family == "Smith-Jones"
+
+
+def test_split_name_apostrophe():
+ """Test splitting names with apostrophes."""
+ given, family = NameProcessor.split_name("John O'Connor")
+ assert given == "John"
+ assert family == "O'Connor"
+
+
+def test_split_name_unicode():
+ """Test splitting names with Unicode characters."""
+ given, family = NameProcessor.split_name("José García")
+ assert given == "José"
+ assert family == "García"
+
+
+def test_split_name_multiple_commas():
+ """Test splitting name with multiple commas (should split on first)."""
+ given, family = NameProcessor.split_name("Doe, Jr., John")
+ assert given == "Jr., John"
+ assert family == "Doe"
+
+
+# String normalization edge cases
+def test_normalize_string_ascii():
+ """Test normalization of basic ASCII string."""
+ result = normalize_string("Hello World")
+ assert result == "Hello World"
+
+
+def test_normalize_string_accents():
+ """Test normalization of Unicode accented characters."""
+ result = normalize_string("Café résumé naïve")
+ assert result == "Cafe resume naive"
+
+
+def test_normalize_string_german_umlauts():
+ """Test normalization of German umlauts."""
+ result = normalize_string("Müller Größe")
+ assert result == "Muller Groe"
+
+
+def test_normalize_string_scandinavian_chars():
+ """Test normalization of Scandinavian characters."""
+ result = normalize_string("Åse Ørsted")
+ # Some implementations may preserve more characters
+ assert "Ase" in result and "rsted" in result
+
+
+def test_normalize_string_mixed_scripts():
+ """Test normalization with mixed scripts removes non-ASCII."""
+ result = normalize_string("Hello 世界 Мир")
+ assert result == "Hello"
+
+
+def test_normalize_string_empty():
+ """Test normalization of empty string."""
+ result = normalize_string("")
+ assert result == ""
+
+
+def test_normalize_string_whitespace():
+ """Test normalization of whitespace-only string."""
+ result = normalize_string(" \n\t ")
+ assert result == ""
+
+
+def test_normalize_string_trim_whitespace():
+ """Test leading/trailing whitespace is stripped."""
+ result = normalize_string(" Hello World ")
+ assert result == "Hello World"
+
+
+def test_normalize_string_numbers_punctuation():
+ """Test normalization preserves numbers and punctuation."""
+ result = normalize_string("Test 123! (2023)")
+ assert result == "Test 123! (2023)"
+
+
+def test_normalize_string_ligatures():
+ """Test normalization of Unicode ligatures."""
+ result = normalize_string("file flag") # fi and fl ligatures
+ assert result == "file flag"
+
+
+def test_normalize_string_combining_marks():
+ """Test normalization of combining diacritical marks."""
+ # e with combining acute accent vs precomposed é
+ combining = "e\u0301" # e + combining acute
+ precomposed = "é"
+
+ result1 = normalize_string(combining)
+ result2 = normalize_string(precomposed)
+
+ assert result1 == result2 == "e"
+
+
+# Integration tests
+def test_doi_to_filename():
+ """Test pipeline from DOI validation to filename generation."""
+ doi = "doi:10.1234/example.article-2023"
+
+ # Validate DOI
+ assert validate_doi(doi) is True
+
+ # Normalize DOI
+ normalized = normalize_doi(doi)
+ assert normalized == "10.1234/example.article-2023"
+
+ # Sanitize for filename
+ filename = sanitize_filename(normalized)
+ assert filename == "10_1234_example_article_2023"
+
+
+def test_author_name_processing():
+ """Test pipeline for processing author names."""
+ author_name = "García-López, José María"
+
+ # Split name
+ given, family = NameProcessor.split_name(author_name)
+ assert given == "José María"
+ assert family == "García-López"
+
+ # Normalize for comparison - actual behavior may vary
+ normalized_given = normalize_string(given)
+ normalized_family = normalize_string(family)
+ # Test that normalization occurred, exact result may vary
+ assert len(normalized_given) > 0
+ assert len(normalized_family) > 0
+
+
+def test_validation_error_handling():
+ """Test validation functions handle errors gracefully."""
+ # Test with empty inputs
+ assert validate_doi("") is False
+ assert sanitize_filename("") == ""
+
+ # Test with edge case inputs
+ weird_input = " \n\t "
+ assert normalize_string(weird_input) == ""
+
+ try:
+ given, family = NameProcessor.split_name(weird_input)
+ assert given == ""
+ assert family == ""
+ except IndexError:
+ # NameProcessor may raise IndexError for edge case inputs
+ pass
+
+
+def test_config_partial_environment_variable_override():
+ """Test that only some environment variables can be set, others fall back to config file."""
+ # Create a temporary config file with base values
+ config_data = {
+ "dataverse": {
+ "url": "https://config-file-url.org",
+ "api_token": "config-file-token",
+ "dataverse": "config-file-dataverse",
+ "auth_user": "config-file-user",
+ "auth_password": "config-file-password",
+ },
+ "pis": [],
+ "default_grants": [],
+ }
+
+ with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
+ yaml.dump(config_data, f)
+ temp_config_path = f.name
+
+ try:
+ # Set only some environment variables
+ os.environ["DATAVERSE_URL"] = "https://env-url.org"
+ os.environ["DATAVERSE_API_TOKEN"] = "env-token"
+ # Don't set DATAVERSE_DATAVERSE, DATAVERSE_AUTH_USER, DATAVERSE_AUTH_PASSWORD
+
+ # Reset the Config singleton to ensure fresh load
+ Config._instance = None
+ Config._config_data = None
+
+ # Load config with partial environment variables
+ Config.load_config(temp_config_path)
+ config = Config()
+
+ # Verify environment variables override where set
+ assert config.DATAVERSE["url"] == "https://env-url.org"
+ assert config.DATAVERSE["api_token"] == "env-token"
+
+ # Verify config file values are used where env vars are not set
+ assert config.DATAVERSE["dataverse"] == "config-file-dataverse"
+ assert config.DATAVERSE["auth_user"] == "config-file-user"
+ assert config.DATAVERSE["auth_password"] == "config-file-password"
+
+ finally:
+ # Clean up environment variables
+ for env_var in ["DATAVERSE_URL", "DATAVERSE_API_TOKEN"]:
+ if env_var in os.environ:
+ del os.environ[env_var]
+
+ # Clean up temp file
+ os.unlink(temp_config_path)
+
+ # Reset Config singleton
+ Config._instance = None
+ Config._config_data = None