Looking for answers without outsourcing to LLM
Some examples of where i get limited by a specific programming language/framework’s syntax/particular way of doing things, and how I navigate it by trying to see if i can solve it myself based on what i know or good old internet search, (i.e. manually looking at links/posts) and learning a bit more in the process.
I know there is research now showing the impact of using LLMs on our brain, but to me, this is just how I want to do my work. I want to be involved.
Navigation
1
I want to get access to the subject_id within my Pytorch dataset. I know that
it is present in a dictionary for each row of my data.
(Pdb) eval_dataset
<__main__.Cardiomegaly object at 0x00000298DF1C8BC0>
(Pdb) dir(eval_dataset)
['__add__', '__class__', '__class_getitem__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__orig_bases__', '__parameters__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'df', 'img_dir', 'transform']
(Pdb) eval_dataset[0]
(tensor([....]]), tensor([0.]), {'subject_id': '10000032', 'study_id': np.int64(50414267)})
(Pdb) type(eval_dataset[0])
<class 'tuple'>
(Pdb) eval_dataset[0][0]
tensor([[[ 0.9988, 0.5878, 0.1939, ..., -0.0972, -0.1143, -0.1486],
[ 0.9817, 0.6392, 0.2796, ..., -0.1143, -0.1314, -0.0801],
[ 1.0331, 0.6906, 0.4166, ..., 0.0569, 0.0398, 0.0056],
...,
...,
[-1.5430, -1.6650, -1.7522, ..., -1.7696, -1.7696, -1.7696],
[-1.5430, -1.6302, -1.7522, ..., -1.8044, -1.8044, -1.8044],
[-1.5430, -1.6476, -1.7347, ..., -1.8044, -1.8044, -1.8044]]])
(Pdb) eval_dataset[0][1]
tensor([0.])
(Pdb) eval_dataset[0][2]
{'subject_id': '10000032', 'study_id': np.int64(50414267)}
And there I have it, so now collecting the subject_ids is a matter of:
(Pdb) subject_ids = [row[2]["subject_id"] for row in eval_dataset]
Python makes it easy to inspect objects and get things. Other languages, a bit trickier perhaps, but it’s possible, we have the good old printing!
2
I have an integer and i want to just run a loop that many times.
- I search for
for loop javascript - I pick the MDN link
I preferred MDN over w3schools since i am aware of the authority of MDN - similar to w3schools, of course for me, but just went with MDN.
3
I have a dataframe which is multi indexed and i want to each each unique combination of index columns data:
(Pdb) insurance_metrics
n tp tn fp fn accuracy precision recall f1 auc pr_auc
dataset insurance
eval_1_cardiomegaly.csv.gz Medicaid 14877.0 1816.0 9472.0 2897.0 692.0 0.758755 0.385317 0.724083 0.502977 0.826120 0.481929
Medicare 45393.0 8678.0 21264.0 13126.0 2325.0 0.659617 0.398000 0.788694 0.529033 0.772568 0.493489
...
I want to get access to the data for each unique combination of dataset and insurance":
insurance_metrics.index
MultiIndex([('eval_1_cardiomegaly.csv.gz', 'Medicaid'),
('eval_1_cardiomegaly.csv.gz', 'Medicare'),
('eval_1_cardiomegaly.csv.gz', 'No charge'),
Search query: “pandas selecting by index value”
Reading, https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-label
The .loc attribute is the primary access method. The following are valid inputs: A single label, e.g. 5 or ‘a’ (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).
A list or array of labels [‘a’, ‘b’, ‘c’].
A slice object with labels ‘a’:‘f’. Note that contrary to usual Python slices, both the start and the stop are included, when present in the index! See Slicing with labels.
A boolean array.
My brain goes. omg.
So, i start hacking around:
(Pdb) insurance_metrics.get(('eval_1_cardiomegaly.csv.gz', 'Medicaid'))
No dice.
Okay, I know, loc, and i know that i want to specify a tuple, since I have a MultiIndex:
(Pdb) insurance_metrics.loc(('eval_1_cardiomegaly.csv.gz', 'Medicaid'))
*** ValueError: No axis named ('eval_1_cardiomegaly.csv.gz', 'Medicaid') for object type DataFrame
(Pdb) insurance_metrics.loc[("eval_1_cardiomegaly.csv.gz", "Medicare")]
n 45393.000000
tp 8678.000000
tn 21264.000000
fp 13126.000000
fn 2325.000000
accuracy 0.659617
precision 0.398000
recall 0.788694
f1 0.529033
auc 0.772568
pr_auc 0.493489
Name: (eval_1_cardiomegaly.csv.gz, Medicare), dtype: float64
Okay, so i have what i needed. To automate it, i am just gonna use groupby:
(Pdb) items=[(d, i) for d, i in insurance_metrics.groupby(by=['dataset', 'insurance'])]
(Pdb) type(items[0])
<class 'tuple'>
(Pdb) items[0][0]
('eval_1_cardiomegaly.csv.gz', 'Medicaid')
(Pdb) items[0][1]
n tp tn fp fn accuracy precision recall f1 auc pr_auc
dataset insurance
eval_1_cardiomegaly.csv.gz Medicaid 14877.0 1816.0 9472.0 2897.0 692.0 0.758755 0.385317 0.724083 0.502977 0.82612 0.481929
Okay, so apparentely, groupby works with index columns too.
4
I have two dataframes, new_results and baseline_results.
I want to copy all the data from a baseline_results for a specific column value into new_results. In this case,
the column value is baseline_cardiomegaly.csv.gz for dataset column.
My trial and error attempts and how i got there:
(Pdb) new_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 1 1 0.649867 ... eval_1_cardiomegaly.csv.gz M WHITE Medicare
... ... ... ... ... ... ... ... ... ...
[142698 rows x 11 columns]
```python
(Pdb) baseline_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 0 0 0.057635 ... baseline_cardiomegaly.csv.gz M RECORD_NOT_FOUND RECORD_NOT_FOUND
[362281 rows x 12 columns]
Awfully bad attempts, without even thinking, one might say, hands typing, brain listening to music (perhaps from the muscle memory of yesterday):
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results["baseline_cardiomegaly.csv.gz"]
*** KeyError: 'baseline_cardiomegaly.csv.gz'
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results.loc("baseline_cardiomegaly.csv.gz")
*** ValueError: No axis named baseline_cardiomegaly.csv.gz for object type DataFrame
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results.loc(["baseline_cardiomegaly.csv.gz"])
*** TypeError: unhashable type: 'list'
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = pd.Series(baseline_results.loc(["baseline_cardiomegaly.csv.gz"]))
*** TypeError: unhashable type: 'list'
(Pdb) baseline_results.loc(["baseline_cardiomegaly.csv.gz"])
*** TypeError: unhashable type: 'list'
(Pdb) baseline_results.loc(("baseline_cardiomegaly.csv.gz"))
*** ValueError: No axis named baseline_cardiomegaly.csv.gz for object type DataFrame
The above attempts are all my brain not considering the fact that I am choosing the value for a specific column and the value itself is not an index or a column name.
dataset is a column and that’s the column I must look up (not an index).
Once the brain has that updated context, I struggle with the exact syntax for filtering a bit:
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results[dataset == "baseline_cardiomegaly..csv.gz"]]
*** NameError: name 'dataset' is not defined
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results["dataset" == "baseline_cardiomegaly..csv.gz"]]
*** KeyError: False
(Pdb) new_results["baseline_cardiomegaly.csv.gz"] = baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"]
*** ValueError: Cannot set a DataFrame with multiple columns to the single column baseline_cardiomegaly.csv.gz
At this point I realize, what i am doing wrong, i have the selection correct, but i am trying to put in multiple columns and assign it to a single column,
so I need concat which I again struggle with the right syntax:
(Pdb) new_results = new_results.concat(baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"])
*** AttributeError: 'DataFrame' object has no attribute 'concat'
(Pdb) new_results = pd.concat(new_results, baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"])
*** TypeError: concat() takes 1 positional argument but 2 were given
(Pdb) new_results = pd.concat([baseline_results[baseline_results["dataset"] == "baseline_cardiomegaly.csv.gz"], new_results])
Okay finally i have it! Trial and error is my favorite way to learn, the brain needs to take the paths to the solution and there is a satisfaction I derive from that process:
(Pdb) new_results
Unnamed: 0 y_true y_pred prob_1 ... dataset gender race insurance
0 0 0 0 0.057635 ... baseline_cardiomegaly.csv.gz M RECORD_NOT_FOUND RECORD_NOT_FOUND
[362281 rows x 12 columns]
Okay, so I am training myself i think. I am my favorite agent.
5
I want the equivalent of ternary operator ? (in C or javascript) in Python. I have forgotten exactly how, but i know that
i can use if..else in list comprehension, so perhaps it works outside it too?
Let’s try:
>>> import random
>>> value = 1 if random.random() > 0.5 else 0
>>> value
0
>>> value = 1 if random.random() > 0.5 else 0
>>> value
0
>>> value = 1 if random.random() > 0.5 else 0
>>> value
1
yeah, it does.
of course, random.random() > 0.5 is an example of a conditional evaluation.
6
I want to access, feature-prep directory which is at the same level as utils, from inside a file inside utils. Basically:
- Traverse one directory up from
utils - Go down another directory,
feature-prep
(Pdb) Path(__file__) / "..//feature-prep"
WindowsPath('C:/Users/amits/work/github.com/amitsaha/ml-fairness-health/mywork/experiments/mimic-cxr/utils/common_experiment.py/../feature-prep')
(Pdb) import os
(Pdb) os.path.exists(Path(__file__) / "..//feature-prep")
False
(Pdb) os.path.exists(Path(__file__))
True
(Pdb) os.path.basename(Path(__file__))
'common_experiment.py'
(Pdb) os.path.dirname(Path(__file__))
'C:\\Users\\amits\\work\\github.com\\amitsaha\\ml-fairness-health\\mywork\\experiments\\mimic-cxr\\utils'
(Pdb) os.path.dirname(os.path.dirname(Path(__file__)))
'C:\\Users\\amits\\work\\github.com\\amitsaha\\ml-fairness-health\\mywork\\experiments\\mimic-cxr'
(Pdb) os.path.exists(os.path.dirname(os.path.dirname(Path(__file__)))
*** SyntaxError: '(' was never closed
(Pdb) os.path.exists(os.path.dirname(os.path.dirname(Path(__file__))))
True
(Pdb) os.path.exists(Path(os.path.dirname(os.path.dirname(Path(__file__)))) / "feature-prep")
True
(Pdb) os.path.exists(Path(os.path.dirname(os.path.dirname(Path(__file__)))) / "feature-prep")
7
> const userQueries = ["foo", "bar"];
undefined
> console.log(userQueries)
[ 'foo', 'bar' ]
undefined
> for (const item in userQueries) {
... console.log(item)
... }
0
1
undefined
> for (const idx, item in userQueries) {
for (const idx, item in userQueries) {
^^^
Uncaught SyntaxError: Missing initializer in const declaration
> for (const {idx, item} in userQueries) {
... console.log(item)
... {
... }
... }
undefined
undefined
undefined
> for (const item in Object.entries(userQueries)) {
... console.log(item)
... }
0
1
undefined
Then, i come across https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration#for...of_statement when I search again and in the Google AI summary.
So, we have a for..of !? (why???)
> for (const item of userQueries) {
... console.log(item)
... }
foo
bar
undefined
8
Needed to upgrade some projects to tailwind v4 from v3.
Used Github Copilot AI to see if i can get it do it, took me on a wild ride, with no luck and it appears to have stuck.
Then it hit me, let’s try https://tailwindcss.com/docs/upgrade-guide and the command worked like a charm.
9
I wanted to find the row which has the minimum and maximum value for recall in this dataframe:
(Pdb) current_group_results
n tp tn fp fn accuracy precision recall f1 auc pr_auc
race
AMERICAN INDIAN 4.0 0.0 1.0 0.0 3.0 0.250000 0.000000 0.000000 0.000000 0.333333 0.805556
ASIAN 27.0 1.0 21.0 1.0 4.0 0.814815 0.500000 0.200000 0.285714 0.754545 0.379762
BLACK 145.0 4.0 113.0 11.0 17.0 0.806897 0.266667 0.190476 0.222222 0.763441 0.299849
HISPANIC 66.0 3.0 55.0 1.0 7.0 0.878788 0.750000 0.300000 0.428571 0.757143 0.430238
OTHER 81.0 9.0 55.0 4.0 13.0 0.790123 0.692308 0.409091 0.514286 0.822804 0.660273
RECORD_NOT_PRESENT 121.0 2.0 111.0 2.0 6.0 0.933884 0.500000 0.250000 0.333333 0.853982 0.417907
WHITE 536.0 31.0 398.0 16.0 91.0 0.800373 0.659574 0.254098 0.366864 0.825097 0.581051
(Pdb) type(current_group_results)
<class 'pandas.core.frame.DataFrame'>
First attempt, just to check if i remembered it correctly:
(Pdb) current_group_results.minidx
*** AttributeError: 'DataFrame' object has no attribute 'minidx'
I had a vague recollection that there is a method named in that manner, so i do a dir:
(Pdb) dir(current_group_results)
['...
'idxmax', 'idxmin', ...]
Once i spotted those, I was curious if they would also work on the series:
(Pdb) type(current_group_results["recall"])
<class 'pandas.core.series.Series'>
Let’s try it:
(Pdb) current_group_results["recall"]
AMERICAN INDIAN 0.000000
ASIAN 0.200000
BLACK 0.190476
HISPANIC 0.300000
OTHER 0.409091
RECORD_NOT_PRESENT 0.250000
WHITE 0.254098
Name: recall, dtype: float64>
(Pdb) current_group_results["recall"].idxmax()
'OTHER'
(Pdb) current_group_results["recall"].idxmin()
'AMERICAN INDIAN'
We can get the max and min values as well:
(Pdb) current_group_results["recall"]["OTHER"]
np.float64(0.4090909090909091)
(Pdb) current_group_results["recall"]["AMERICAN INDIAN"]
np.float64(0.0)
10
I want to filter a pandas dataframe rows by combining two filters/conditions. I have done this many times, but everytime without giving myself a chance, i have had to search or ask an LLM for the exact syntax.
This time, I decided not to outsource.
I got it by brute force and now i know i will likely remember this:
(Pdb) retraining_dataset[retraining_dataset[group] == min_recall_category]
Unnamed: 0 dicom_id subject_id study_id Cardiomegaly split race gender insurance
..
[704 rows x 9 columns]
(Pdb) retraining_dataset[retraining_dataset[group] == min_recall_category && retraining_dataset[Cardiomegaly]]
*** SyntaxError: invalid syntax
(Pdb) retraining_dataset[retraining_dataset[group] == min_recall_category & retraining_dataset[Cardiomegaly]]
*** NameError: name 'Cardiomegaly' is not defined
(Pdb) retraining_dataset[retraining_dataset[group] == min_recall_category & retraining_dataset['Cardiomegaly']]
*** TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]
(Pdb) retraining_dataset[(retraining_dataset[group] == min_recall_category) && (retraining_dataset['Cardiomegaly'])]
*** SyntaxError: invalid syntax
(Pdb) retraining_dataset[(retraining_dataset[group] == min_recall_category) && (retraining_dataset['Cardiomegaly']==True)]
*** SyntaxError: invalid syntax
(Pdb) retraining_dataset[(retraining_dataset[group] == min_recall_category) & (retraining_dataset['Cardiomegaly']==True)]
WORKS
So, the key is:
- Individual conditions in parentheses
- Combine them by
&(or|(foror)