Quaternary oxide composition generation#
This tutorial demonstrates how to generate a set of quaternary oxide compositions using the SMACT library and a modified smact_filter function from the smact_screening module then prepare the results for machine learning analysis.
Prerequisites#
Before starting, ensure you have the following libraries installed:
# Install the required packages
try:
import google.colab
IN_COLAB = True
except:
IN_COLAB = False
if IN_COLAB:
!uv pip install smact[featurisers] --quiet
Workflow#
1. Import required libraries#
"""
This module imports necessary libraries and modules for generating and analyzing
quaternary oxide compositions using SMACT and machine learning techniques.
"""
# Standard library imports
import multiprocessing
from itertools import combinations, product
# Third-party imports
import pandas as pd
from matminer.featurizers import composition as cf
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers.conversions import StrToComposition
from pymatgen.core import Composition
# Local imports
import smact
from smact import screening
"""
Imported modules:
- multiprocessing: For parallel processing capabilities
- itertools: For generating combinations and products
- pandas: For data manipulation and analysis
- matminer: For materials data mining and feature extraction
- pymatgen: For materials analysis
- smact: For structure prediction and analysis of new materials
"""
2. Define elements and combinations#
Select the elements to use in our compositions
# Define the elements we are interested in
all_el = smact.element_dictionary()
symbol_list = [k for k in all_el.keys()]
# List of elements to exclude
do_not_want = [
"H",
"He",
"B",
"C",
"O",
"Ne",
"Ar",
"Kr",
"Tc",
"Xe",
"Rn",
"Ac",
"Th",
"Pa",
"U",
"Np",
"Pu",
"Am",
"Cm",
"Bk",
"Cf",
"Es",
"Fm",
"Md",
"No",
"Lr",
"Ra",
"Fr",
"At",
"Po",
"Pm",
"Eu",
"Tb",
"Yb",
]
# Create a list of elements we want to use
good_elements = [all_el[x] for x in symbol_list if x not in do_not_want]
# Generate all possible combinations of 3 elements from good_elements
all_el_combos = combinations(good_elements, 3)
3. Define SMACT filtering function#
Create a function to filter element combinations based on SMACT criteria:
def smact_filter(els):
"""
Filter element combinations based on SMACT criteria.
This function takes a combination of elements and applies SMACT
(Semiconducting Materials from Analogy and Chemical Theory) tests
to generate potential quaternary oxide compositions.
Args:
els (tuple): A tuple containing three Element objects.
Returns:
list: A list of tuples, each containing a set of elements and their ratios
that pass the SMACT criteria.
"""
all_compounds = []
elements = [e.symbol for e in els] + ["O"]
# Get Pauling electronegativities
paul_a, paul_b, paul_c = (el.pauling_eneg for el in els)
electronegativities = [paul_a, paul_b, paul_c, 3.44] # 3.44 is for Oxygen
# Iterate through all possible oxidation state combinations
for ox_states in product(*(el.oxidation_states for el in els)):
ox_states = list(ox_states) + [-2] # Add oxygen's oxidation state
# Test for charge balance
cn_e, cn_r = smact.neutral_ratios(ox_states, threshold=8)
if cn_e:
# Electronegativity test
if screening.pauling_test(ox_states, electronegativities):
compound = (elements, cn_r[0])
all_compounds.append(compound)
return all_compounds
4. Process element combinations#
Use multiprocessing to apply the SMACT filter to all element combinations:
Here multiprocessing is used to speed things up (generation of all compositions takes ~40 minutes on a 4GHz Intel core i7 iMac).
def process_element_combinations(all_el_combos):
"""
Process all element combinations using multiprocessing.
This function applies the smact_filter to all element combinations
using a multiprocessing pool to improve performance.
Args:
all_el_combos (iterable): An iterable of element combinations.
Returns:
list: A flattened list of all compounds that pass the SMACT criteria.
"""
with multiprocessing.Pool() as p:
# Apply smact_filter to all element combinations in parallel
result = p.map(smact_filter, all_el_combos)
# Flatten the list of results
flat_list = [item for sublist in result for item in sublist]
return flat_list
# Process all element combinations
flat_list = process_element_combinations(all_el_combos)
# Print the number of compositions found
print(f"Number of compositions: {len(flat_list)}")
5. Generate pretty formulas#
This step turns the generated compositions into pretty formulas, again using multiprocessing. There should be ~1.1M unique formulas.
def comp_maker(comp):
"""
Convert a composition tuple to a pretty formula string.
Args:
comp (tuple): A tuple containing two lists - elements and their amounts.
Returns:
str: The reduced formula of the composition as a string.
"""
# Create a list to store elements and their amounts
form = []
# Iterate through elements and their amounts
for el, ammt in zip(comp[0], comp[1]):
form.append(el)
form.append(ammt)
# Join all elements into a single string
form = "".join(str(e) for e in form)
# Convert to a Composition object and get the reduced formula
pmg_form = Composition(form).reduced_formula
return pmg_form
# Use multiprocessing to apply comp_maker to all compositions in flat_list
with multiprocessing.Pool() as p:
pretty_formulas = p.map(comp_maker, flat_list)
# Create a list of unique formulas
unique_pretty_formulas = list(set(pretty_formulas))
# Print the number of unique composition formulas
print(f"Number of unique compositions formulas: {len(unique_pretty_formulas)}")
6. Create DataFrame and add descriptors#
Create a DataFrame from the unique formulas and add composition-based descriptors:
# Create a DataFrame from the unique pretty formulas
new_data = pd.DataFrame(unique_pretty_formulas).rename(columns={0: "pretty_formula"})
# Remove any duplicate formulas to ensure uniqueness
new_data = new_data.drop_duplicates(subset="pretty_formula")
# Display summary statistics of the DataFrame
# This will show count, unique values, top value, and its frequency
# new_data.describe()
# Add descriptor columns
# This will take a little time as we have over 1 million rows
def add_descriptors(data):
"""
Add composition-based descriptors to the dataframe.
This function converts formula strings to composition objects and calculates
various features using matminer's composition featurizers.
Args:
data (pd.DataFrame): DataFrame containing 'pretty_formula' column.
Returns:
pd.DataFrame: DataFrame with added descriptor columns.
"""
# Convert formula strings to composition objects
str_to_comp = StrToComposition(target_col_id="composition_obj")
str_to_comp.featurize_dataframe(data, col_id="pretty_formula")
# Initialize multiple featurizers
feature_calculators = MultipleFeaturizer(
[
cf.Stoichiometry(),
cf.ElementProperty.from_preset("magpie"),
cf.ValenceOrbital(props=["avg"]),
cf.IonProperty(fast=True),
cf.BandCenter(),
cf.AtomicOrbitals(),
]
)
# Calculate features
feature_calculators.featurize_dataframe(data, col_id="composition_obj")
# If you need to use feature_labels later, uncomment the following line:
# feature_labels = feature_calculators.feature_labels()
return data
# Apply the function to add descriptors
new_data = add_descriptors(new_data)
7. Save results to a CSV file#
# Save as .csv file
new_data.to_csv("All_oxide_comps_dataframe_featurized.csv", chunksize=10000)
Reproducing results#
To reproduce these results:
Ensure all required libraries are installed.
Copy and run the code snippets in order.
Be patient, as the process can take several hours depending on your hardware.
The final output will be a CSV file named “All_oxide_comps_dataframe_featurized.csv” containing all generated compositions with their calculated features.
Note: The exact number of compositions may vary slightly due to the nature of parallel processing and potential updates to the SMACT library